A framework for health equity assessment of machine learning performance

Posted by Mike Schaekermann, Research Scientist, Google Research, and Ivor Horn, Chief Health Equity Officer & Director, Google Core

Health equity is a serious societal concern worldwide with disparities having many causes. These sources embrace limitations in entry to healthcare, variations in scientific remedy, and even elementary variations within the diagnostic know-how. In dermatology for instance, pores and skin most cancers outcomes are worse for populations equivalent to minorities, these with decrease socioeconomic standing, or people with restricted healthcare entry. While there’s nice promise in current advances in machine learning (ML) and synthetic intelligence (AI) to assist enhance healthcare, this transition from analysis to bedside have to be accompanied by a cautious understanding of whether or not and the way they impression health equity.

Health equity is outlined by public health organizations as equity of alternative for everybody to be as wholesome as potential. Importantly, equity could also be completely different from equality. For instance, individuals with higher obstacles to enhancing their health could require extra or completely different effort to expertise this honest alternative. Similarly, equity just isn’t equity as outlined within the AI for healthcare literature. Whereas AI equity typically strives for equal performance of the AI know-how throughout completely different affected person populations, this doesn’t middle the purpose of prioritizing performance with respect to pre-existing health disparities.

Health equity concerns. An intervention (e.g., an ML-based software, indicated in darkish blue) promotes health equity if it helps cut back current disparities in health outcomes (indicated in lighter blue).

In “Health Equity Assessment of machine Learning performance (HEAL): a framework and dermatology AI model case study”, revealed in The Lancet eClinicalMedicine, we suggest a strategy to quantitatively assess whether or not ML-based health applied sciences carry out equitably. In different phrases, does the ML mannequin carry out effectively for these with the worst health outcomes for the situation(s) the mannequin is supposed to handle? This purpose anchors on the precept that health equity ought to prioritize and measure mannequin performance with respect to disparate health outcomes, which can be because of a quantity of components that embrace structural inequities (e.g., demographic, social, cultural, political, financial, environmental and geographic).

The health equity framework (HEAL)

The HEAL framework proposes a 4-step course of to estimate the chance that an ML-based health know-how performs equitably:

Identify components related to health inequities and outline software performance metrics,
Identify and quantify pre-existing health disparities,
Measure the performance of the software for every subpopulation,
Measure the chance that the software prioritizes performance with respect to health disparities.

The closing step’s output is termed the HEAL metric, which quantifies how anticorrelated the ML mannequin’s performance is with health disparities. In different phrases, does the mannequin carry out higher with populations which have the more serious health outcomes?

This 4-step course of is designed to tell enhancements for making ML mannequin performance extra equitable, and is supposed to be iterative and re-evaluated regularly. For instance, the supply of health outcomes information in step (2) can inform the selection of demographic components and brackets in step (1), and the framework could be utilized once more with new datasets, fashions and populations.

Framework for Health Equity Assessment of machine Learning performance (HEAL). Our guideline is to keep away from exacerbating health inequities, and these steps assist us establish disparities and assess for inequitable mannequin performance to maneuver in direction of higher outcomes for all.

With this work, we take a step in direction of encouraging express assessment of the health equity concerns of AI applied sciences, and encourage prioritization of efforts throughout mannequin improvement to scale back health inequities for subpopulations uncovered to structural inequities that may precipitate disparate outcomes. We ought to notice that the current framework doesn’t mannequin causal relationships and, due to this fact, can not quantify the precise impression a brand new know-how may have on lowering health end result disparities. However, the HEAL metric could assist establish alternatives for enchancment, the place the present performance just isn’t prioritized with respect to pre-existing health disparities.

Case examine on a dermatology mannequin

As an illustrative case examine, we utilized the framework to a dermatology mannequin, which makes use of a convolutional neural community just like that described in prior work. This instance dermatology mannequin was skilled to categorise 288 pores and skin circumstances utilizing a improvement dataset of 29k circumstances. The enter to the mannequin consists of three images of a pores and skin concern together with demographic info and a quick structured medical historical past. The output consists of a ranked checklist of potential matching pores and skin circumstances.

Using the HEAL framework, we evaluated this mannequin by assessing whether or not it prioritized performance with respect to pre-existing health outcomes. The mannequin was designed to foretell potential dermatologic circumstances (from a listing of a whole lot) based mostly on images of a pores and skin concern and affected person metadata. Evaluation of the mannequin is completed utilizing a top-3 settlement metric, which quantifies how typically the highest 3 output circumstances match the probably situation as advised by a dermatologist panel. The HEAL metric is computed by way of the anticorrelation of this top-3 settlement with health end result rankings.

We used a dataset of 5,420 teledermatology circumstances, enriched for variety in age, intercourse and race/ethnicity, to retrospectively consider the mannequin’s HEAL metric. The dataset consisted of “store-and-forward” circumstances from sufferers of 20 years or older from major care suppliers within the USA and pores and skin most cancers clinics in Australia. Based on a evaluate of the literature, we determined to discover race/ethnicity, intercourse and age as potential components of inequity, and used sampling methods to make sure that our analysis dataset had ample illustration of all race/ethnicity, intercourse and age teams. To quantify pre-existing health outcomes for every subgroup we relied on measurements from public databases endorsed by the World Health Organization, equivalent to Years of Life Lost (YLLs) and Disability-Adjusted Life Years (DALYs; years of life misplaced plus years lived with incapacity).

HEAL metric for all dermatologic circumstances throughout race/ethnicity subpopulations, together with health outcomes (YLLs per 100,000), mannequin performance (top-3 settlement), and rankings for health outcomes and power performance.
(* Higher is best; measures the chance the mannequin performs equitably with respect to the axes on this desk.)

HEAL metric for all dermatologic circumstances throughout sexes, together with health outcomes (DALYs per 100,000), mannequin performance (top-3 settlement), and rankings for health outcomes and power performance. (* As above.)

Our evaluation estimated that the mannequin was 80.5% prone to carry out equitably throughout race/ethnicity subgroups and 92.1% prone to carry out equitably throughout sexes.

However, whereas the mannequin was prone to carry out equitably throughout age teams for most cancers circumstances particularly, we found that it had room for enchancment throughout age teams for non-cancer circumstances. For instance, these 70+ have the poorest health outcomes associated to non-cancer pores and skin circumstances, but the mannequin did not prioritize performance for this subgroup.

HEAL metrics for all most cancers and non-cancer dermatologic circumstances throughout age teams, together with health outcomes (DALYs per 100,000), mannequin performance (top-3 settlement), and rankings for health outcomes and power performance. (* As above.)

Putting issues in context

For holistic analysis, the HEAL metric can’t be employed in isolation. Instead this metric needs to be contextualized alongside many different components starting from computational effectivity and information privateness to moral values, and facets which will affect the outcomes (e.g., choice bias or variations in representativeness of the analysis information throughout demographic teams).

As an adversarial instance, the HEAL metric could be artificially improved by intentionally lowering mannequin performance for essentially the most advantaged subpopulation till performance for that subpopulation is worse than all others. For illustrative functions, given subpopulations A and B the place A has worse health outcomes than B, contemplate the selection between two fashions: Model 1 (M1) performs 5% higher for subpopulation A than for subpopulation B. Model 2 (M2) performs 5% worse on subpopulation A than B. The HEAL metric could be greater for M1 as a result of it prioritizes performance on a subpopulation with worse outcomes. However, M1 could have absolute performances of simply 75% and 70% for subpopulations A and B respectively, whereas M2 has absolute performances of 75% and 80% for subpopulations A and B respectively. Choosing M1 over M2 would result in worse general performance for all subpopulations as a result of some subpopulations are worse-off whereas no subpopulation is better-off.

Accordingly, the HEAL metric needs to be used alongside a Pareto situation (mentioned additional within the paper), which restricts mannequin adjustments in order that outcomes for every subpopulation are both unchanged or improved in comparison with the established order, and performance doesn’t worsen for any subpopulation.

The HEAL framework, in its present kind, assesses the chance that an ML-based mannequin prioritizes performance for subpopulations with respect to pre-existing health disparities for particular subpopulations. This differs from the purpose of understanding whether or not ML will cut back disparities in outcomes throughout subpopulations in actuality. Specifically, modeling enhancements in outcomes requires a causal understanding of steps within the care journey that occur each earlier than and after use of any given mannequin. Future analysis is required to handle this hole.

Conclusion

The HEAL framework permits a quantitative assessment of the chance that health AI applied sciences prioritize performance with respect to health disparities. The case examine demonstrates tips on how to apply the framework within the dermatological area, indicating a excessive chance that mannequin performance is prioritized with respect to health disparities throughout intercourse and race/ethnicity, but in addition revealing the potential for enhancements for non-cancer circumstances throughout age. The case examine additionally illustrates limitations within the skill to use all beneficial facets of the framework (e.g., mapping societal context, availability of information), thus highlighting the complexity of health equity concerns of ML-based instruments.

This work is a proposed method to handle a grand problem for AI and health equity, and will present a helpful analysis framework not solely throughout mannequin improvement, however throughout pre-implementation and real-world monitoring phases, e.g., within the kind of health equity dashboards. We maintain that the power of the HEAL framework is in its future software to varied AI instruments and use circumstances and its refinement within the course of. Finally, we acknowledge {that a} profitable method in direction of understanding the impression of AI applied sciences on health equity must be greater than a set of metrics. It would require a set of targets agreed upon by a group that represents those that can be most impacted by a mannequin.

Acknowledgements

The analysis described right here is joint work throughout many groups at Google. We are grateful to all our co-authors: Terry Spitz, Malcolm Pyles, Heather Cole-Lewis, Ellery Wulczyn, Stephen R. Pfohl, Donald Martin, Jr., Ronnachai Jaroensri, Geoff Keeling, Yuan Liu, Stephanie Farquhar, Qinghan Xue, Jenna Lester, Cían Hughes, Patricia Strachan, Fraser Tan, Peggy Bui, Craig H. Mermel, Lily H. Peng, Yossi Matias, Greg S. Corrado, Dale R. Webster, Sunny Virmani, Christopher Semturs, Yun Liu, and Po-Hsuan Cameron Chen. We additionally thank Lauren Winer, Sami Lachgar, Ting-An Lin, Aaron Loh, Morgan Du, Jenny Rizk, Renee Wong, Ashley Carrick, Preeti Singh, Annisah Um’rani, Jessica Schrouff, Alexander Brown, and Anna Iurchenko for their assist of this undertaking.

What's Hot

Important Pages:

A framework for health equity assessment of machine learning performance – Google Research Blog

The health equity framework (HEAL)

Case examine on a dermatology mannequin

Putting issues in context

Conclusion

Acknowledgements

Related Posts