ChatGPT Responses to Immuno-Oncology Questions Are Readable and Reproducible

Several artificial intelligence (AI)-based large language models (LLMs) have been created in recent years. These tools may have clinical utility in education and awareness.

ChatGPT responded to open-ended questions related to immuno-oncology (IO) with human-readable and reproducible answers. Despite this, expert assessment for accuracy is still recommended. These findings were published in The Oncologist. 

Several artificial intelligence (AI)-based large language models (LLMs) have been created in recent years. These tools may have clinical utility in education and awareness.

Investigators from the National Institutes of Health evaluated the ability of 3 LMMs — ChatGPT-3.5, ChatGPT-4, and Google Bard — to respond to questions about immuno-oncology. They created 60 open-ended questions regarding mechanisms, indications, toxicities, and prognosis in immuno-oncology. 

The questions were submitted to the LLMs in June 2023, and the responses were independently reviewed by 2 experts. For each response, if the questions were answered and the answer was reproducible across 3 queries, the accuracy, readability, and relevance of the answers were graded.

ChatGPT-3.5 and ChatGPT-4 answered 100% of questions, but Google Bard responded to only 53.3% (P <.0001). Stratified by domain, Google Bard responded to more questions about mechanism (93.3%) and prognosis (86.7%) and fewer questions about indication (33.3%) and toxicity (0%).

Reproducibility across queries was rated as 95% for ChatGPT-4, 88.3% for ChatGPT-3.5, and 50% for Google Bard (P <.0001).

ChatGPT-3.5 and ChatGPT-4 have demonstrated significant and clinically meaningful utility as decision- and research-aids in various subfields of [immuno-oncology], while Google Bard demonstrated significant limitations, especially in comparison to ChatGPT.  However, the risk of inaccurate or incomplete responses was evident in all LLMs, highlighting the importance of an expert-driven verification of the information provided by these technologies.

The proportion of responses that were deemed fully correct was highest for ChatGPT-4 (75.4%), followed by Chat-GPT-3.5 (58.5%) and Google Bard (43.8%; P =.03). 

More highly relevant answers were provided by ChatGPT-3.5 (77.4%), followed by ChatGPT-4 (71.9%) and Google Bard (43.8%; P =.04). 

All of ChatGPT-4 answers were graded as readable. Readability of ChatGPT-3.5 answers was close at 98.1%. However, only 87.5% of Google Bard answers were readable (P =.02).

The agreement between the 2 reviewers was high for all outcomes (κ range, 0.868-1).

This study was limited by excluding other available LLMs, such as BingAI and Perplexity.

“ChatGPT-3.5 and ChatGPT-4 have demonstrated significant and clinically meaningful utility as decision- and research-aids in various subfields of [immuno-oncology], while Google Bard demonstrated significant limitations, especially in comparison to ChatGPT,” concluded the study authors. 

“However, the risk of inaccurate or incomplete responses was evident in all LLMs, highlighting the importance of an expert-driven verification of the information provided by these technologies.”

References:

Iannantuono GM, Bracken-Clarke D, Karzai F, Choo-Wosoba H, Gulley JL, Floudas CS. Comparison of large language models in answering immuno-oncology questions: a cross-sectional study. Oncologist. 2024;oyae009. doi:10.1093/oncolo/oyae009