Researchers identified gaps in the performance of an ensemble of artificial intelligence (AI)-based models used in analysis of mammography screening. The ensemble performed similarly to radiologist assessment alone, and its performance varied in certain subgroups of women. Study findings were reported in JAMA Network Open.
In this analysis, the researchers used an ensemble of 11 deep-learning models developed in the Digital Mammography Dialogue on Reverse Engineering Assessment and Methods Mammography Challenge for AI. This ensemble of models was called the challenge ensemble method (CEM). In a population of women who had undergone screening mammography, the performance of the CEM, and the CEM in combination with a radiologist’s assessment (CEM+R), were compared with that of the original radiologist assessments for screening mammography interpretation.
Performance of screening approaches was evaluated by measures such as sensitivity, specificity, and the area under the receiving operator characteristic curve (AUROC). Retrospective data from routine mammography screenings at a health network affiliated with the University of California, Los Angeles (UCLA) were used in comparisons. Model performance with this dataset was also compared with performance using data sources from Kaiser Permanente Washington and the Karolinska Institute that had previously been used in model development and validation.
Data were evaluated from 37,317 examinations from the UCLA cohort data. With this cohort, the CEM model showed an AUROC of 0.85 (95% CI, 0.84-0.87). This AUROC was lower than what was reported with the Kaiser Permanente Washington cohort (AUROC, 0.90) and the Karolinska Institute cohort (AUROC, 0.92).
Similarities were seen in performance of the CEM+R model in comparison with radiologist assessments. The CEM+R model had a sensitivity of 0.813, compared with 0.826 for the radiologist assessments (P =.20). For specificity, the CEM+R had a value of 0.925, compared with 0.930 for radiologist assessments alone (P =.18).
In analyses of subgroups, the researchers saw worse performance with the CEM+R model, compared with radiologist assessments, in women who had a prior breast cancer history and in Hispanic women. Both the CEM+R model and radiologist assessment approaches performed worse in women with dense breasts than with nondense breasts.
In women with a prior breast cancer history, the sensitivity of the CEM+R model was 0.596, compared with 0.850 for radiologist assessments (P <.001); the specificity of the CEM+R model was 0.803, compared with 0.945 for radiologist assessments (P <.001).
For Hispanic women, specificity was 0.894 with the CEM+R model, compared with 0.926 for radiologist assessments (P =.004). The researchers noted that there were high proportions of White women in the Kaiser Permanente Washington and Karolinska Institute cohorts included previously in model development and validation.
“The observed performance suggested that promising AI models, even when trained on large data sets, may not necessarily be generalizable to new populations,” the researchers concluded in their report. “Our study underscores the need for external validation of AI models in target populations, especially as multiple commercial algorithms arrive on the market,” they continued.
Disclosures: Some authors declared affiliations with biotech, pharmaceutical, and/or device companies. Please see the original reference for a full list of disclosures.
Reference
Hsu W, Hippe DS, Nakhaei N, et al. External validation of an ensemble model for automated mammography interpretation by artificial intelligence. JAMA Netw Open. 2022;5(11):e2242343. doi:10.1001/jamanetworkopen.2022.42343