For generations, the microscopic examination of tissue samples has served as the cornerstone of cancer diagnosis, a meticulous process where highly trained pathologists meticulously scrutinize cellular structures for telltale signs of malignancy. This nuanced interpretation, akin to deciphering an anonymous examination paper, yields critical insights into the presence, type, and stage of disease, yet conventionally offers no direct pathway to identifying the individual patient. However, the advent of sophisticated artificial intelligence systems in diagnostic pathology is challenging this established paradigm, revealing an unexpected capacity for these algorithms to discern personal characteristics from the very tissue slides they are designed to analyze.
A groundbreaking study, spearheaded by researchers at Harvard Medical School, has illuminated the disconcerting reality that artificial intelligence models developed for cancer identification are not only detecting disease but are also inferring demographic attributes such as race, gender, and age directly from digital pathology images. This unintended capability has raised significant concerns regarding the potential for inherent biases within these AI systems, leading to disparate diagnostic performance across diverse patient populations. The research team’s comprehensive evaluation of several widely deployed AI models underscored a troubling trend: these algorithms did not exhibit uniform accuracy for all individuals. Instead, their diagnostic precision fluctuated demonstrably based on patients’ self-reported demographic identifiers, a discrepancy that the study sought to thoroughly investigate and rectify.
The implications of such algorithmic bias are profound, potentially exacerbating existing health disparities and undermining the promise of AI-driven precision medicine. Recognizing the critical need to address this emergent issue, the researchers have introduced a novel framework named FAIR-Path. This innovative approach has demonstrated a remarkable efficacy in significantly mitigating the identified biases within the AI models subjected to testing, suggesting a promising avenue for ensuring equitable AI application in healthcare. Senior author Kun-Hsing Yu, an associate professor of biomedical informatics at HMS and an assistant professor of pathology at Brigham and Women’s Hospital, expressed surprise at these findings, noting that "reading demographics from a pathology slide is thought of as a ‘mission impossible’ for a human pathologist, so the bias in pathology AI was a surprise to us."
Professor Yu underscored the paramount importance of proactively identifying and rectifying biases in medical AI, asserting that such prejudices can directly compromise diagnostic accuracy and, consequently, patient outcomes. The successful implementation of FAIR-Path offers a beacon of hope, indicating that the pursuit of fairness in cancer pathology AI, and potentially in other medical AI applications, may not necessitate radical overhauls of existing technological infrastructure. The findings, supported in part by federal grants, were detailed in the December 16th issue of Cell Reports Medicine, marking a significant contribution to the discourse on responsible AI development in healthcare.
To rigorously assess the presence and extent of bias, Professor Yu and his collaborators meticulously examined four commonly utilized pathology AI models currently in development for cancer diagnostics. These deep-learning systems, trained on extensive datasets of annotated pathology slides, were engineered to recognize complex biological patterns and subsequently apply this learned knowledge to novel samples. The evaluation employed a substantial, multi-institutional dataset encompassing pathology slides from twenty distinct cancer types, providing a broad and representative testbed for the AI models.
Across all four AI models, a consistent pattern of performance disparities emerged. The systems exhibited reduced accuracy when analyzing samples from specific demographic cohorts defined by race, gender, and age. For instance, the models encountered greater difficulty in distinguishing between lung cancer subtypes in patients of African American descent and in male patients. Similarly, their accuracy in classifying breast cancer subtypes was notably diminished in younger individuals. Furthermore, the models struggled to accurately detect breast, renal, thyroid, and stomach cancers in certain demographic groups. In aggregate, these performance discrepancies were observed in approximately 29 percent of the diagnostic tasks analyzed, highlighting a pervasive issue.
Professor Yu explained that these diagnostic errors stem from the AI systems’ ability to extract demographic information from the tissue images. Subsequently, the models appear to leverage patterns intrinsically linked to these demographic markers when making diagnostic determinations. This observation was particularly unexpected, as Professor Yu stated, "Because we would expect pathology evaluation to be objective. When evaluating images, we don’t necessarily need to know a patient’s demographics to make a diagnosis." This disconnect prompted the research team to delve deeper into the underlying reasons why pathology AI was failing to uphold this expected standard of objectivity.
The researchers identified three principal factors contributing to the observed biases. Firstly, the composition of training data often reflects an inherent imbalance. Obtaining tissue samples from certain demographic groups can be more challenging than from others, leading to datasets that disproportionately represent some populations while underrepresenting others. This imbalance makes it inherently more difficult for AI models to accurately diagnose cancers in underrepresented groups, including those defined by race, age, or gender.
However, Professor Yu emphasized that the problem extended beyond mere data imbalance, stating, "the problem turned out to be much deeper than that." In several instances, the models exhibited poorer performance for specific demographic groups even when the number of available samples was comparable, suggesting a more complex underlying issue.
Further analysis pointed to variations in disease incidence across different populations. Certain cancers manifest with greater frequency in particular demographic groups, allowing AI models to develop a heightened level of accuracy for those specific populations. Consequently, these same models may falter when tasked with diagnosing cancers in populations where those diseases are less common.
Moreover, the research revealed that AI models possess the capability to detect subtle molecular distinctions that correlate with demographic variations. For example, these systems might identify specific mutations in genes critical for cancer progression and then utilize these mutations as predictive indicators for classifying cancer types. This reliance on demographic-associated molecular patterns can inadvertently reduce diagnostic accuracy in populations where these particular mutations are less prevalent. Professor Yu elaborated on this point, remarking, "We found that because AI is so powerful, it can differentiate many obscure biological signals that cannot be detected by standard human evaluation."
Over time, this intricate interplay can lead AI models to prioritize signals more closely associated with demographic characteristics rather than the underlying disease pathology itself, thereby compromising diagnostic efficacy across diverse patient cohorts. Collectively, these findings underscore that bias in pathology AI is not solely a function of the quality and balance of training data but is also profoundly influenced by the very mechanisms through which the models are trained to interpret visual information.
In response to these identified sources of bias, the researchers engineered FAIR-Path, a novel framework built upon an established machine-learning technique known as contrastive learning. This methodology refines the AI training process to encourage models to place greater emphasis on critical diagnostic distinctions, such as the differences between various cancer types, while simultaneously downplaying the significance of less relevant variations, including demographic attributes.
The application of FAIR-Path to the evaluated AI models resulted in a dramatic reduction in diagnostic disparities, with bias decreasing by approximately 88 percent. "We show that by making this small adjustment, the models can learn robust features that make them more generalizable and fairer across different populations," Professor Yu stated. This outcome is particularly encouraging, he added, as it demonstrates that substantial improvements in fairness can be achieved even in the absence of perfectly balanced or entirely representative training datasets.
Looking toward the future, Professor Yu and his team are actively collaborating with institutions globally to investigate pathology AI bias in regions characterized by diverse demographics, clinical practices, and laboratory settings. They are also exploring the adaptability of FAIR-Path for scenarios where data is limited. A further area of research involves understanding the broader impact of AI-driven bias on health disparities and patient outcomes. Ultimately, Professor Yu articulated a clear vision: to develop pathology AI systems that serve as invaluable adjuncts to human experts, facilitating swift, accurate, and equitable diagnoses for every patient. "I think there’s hope that if we are more aware of and careful about how we design AI systems, we can build models that perform well in every population," he concluded.
The study’s authorship includes Shih-Yen Lin, Pei-Chen Tsai, Fang-Yi Su, Chun-Yen Chen, Fuchen Li, Junhan Zhao, Yuk Yeung Ho, Tsung-Lu Michael Lee, Elizabeth Healey, Po-Jen Lin, Ting-Wan Kao, Dmytro Vremenko, Thomas Roetzer-Pejrimovsky, Lynette Sholl, Deborah Dillon, Nancy U. Lin, David Meredith, Keith L. Ligon, Ying-Chun Lo, Nipon Chaisuriya, David J. Cook, Adelheid Woehrer, Jeffrey Meyerhardt, Shuji Ogino, MacLean P. Nasrallah, Jeffrey A. Golden, Sabina Signoretti, and Jung-Hsien Chiang. Funding for this research was generously provided by the National Institute of General Medical Sciences and the National Heart, Lung, and Blood Institute at the National Institutes of Health (grants R35GM142879, R01HL174679), the Department of Defense (Peer Reviewed Cancer Research Program Career Development Award HT9425-231-0523), the American Cancer Society (Research Scholar Grant RSG-24-1253761-01-ESED), a Google Research Scholar Award, a Harvard Medical School Dean’s Innovation Award, the National Science and Technology Council of Taiwan (grants NSTC 113-2917-I-006-009, 112-2634-F-006-003, 113-2321-B-006-023, 114-2917-I-006-016), and a doctoral student scholarship from the Xin Miao Education Foundation. Disclosures indicate that Dr. Ligon has served as a consultant for Travera, Bristol Myers Squibb, Servier, IntegraGen, L.E.K. Consulting, and Blaze Bioscience, holds equity in Travera, and has received research funding from Bristol Myers Squibb and Lilly. Dr. Vremenko is a cofounder and shareholder of Vectorly. The authors acknowledge the use of ChatGPT for editing selected sections to enhance readability, with the understanding that they have thoroughly reviewed and edited the content, taking full responsibility for the published article.
