A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Prashant D. Tailor
Lauren A. Dalvin
John J. Chen
Raymond Iezzi
Timothy W. Olsen
Brittni A. Scruggs
Andrew J. Barkmeier
Sophie J. Bakri
Edwin H. Ryan
Peter H. Tang
D. Wilkin Parke
Peter Belin
Jayanth Sridhar
David Xu, Thomas Jefferson UniversityFollow
Ajay E. Kuriyan, Thomas Jefferson UniversityFollow
Yoshihiro Yonekawa, Thomas Jefferson UniversityFollow
Matthew R. Starr

Document Type

Article

Publication Date

2-6-2024

Comments

This article is the author's final published version in Ophthalmology Science, Volume 4, Issue 4, 2024, Article number 100485.

The published version is available at https://doi.org/10.1016/j.xops.2024.100485.

Abstract

OBJECTIVE: To assess the quality, empathy, and safety of expert edited large language model (LLM), human expert created, and LLM responses to common retina patient questions.

DESIGN: Randomized, masked multicenter study.

PARTICIPANTS: Twenty-one common retina patient questions were randomly assigned among 13 retina specialists.

METHODS: Each expert created a response (Expert) and then edited a LLM (ChatGPT-4)-generated response to that question (Expert + artificial intelligence [AI]), timing themselves for both tasks. Five LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, and Bard) also generated responses to each question. The original question along with anonymized and randomized Expert + AI, Expert, and LLM responses were evaluated by the other experts who did not write an expert response to the question. Evaluators judged quality and empathy (very poor, poor, acceptable, good, or very good) along with safety metrics (incorrect information, likelihood to cause harm, extent of harm, and missing content).

MAIN OUTCOME: Mean quality and empathy score, proportion of responses with incorrect information, likelihood to cause harm, extent of harm, and missing content for each response type.

RESULTS: There were 4008 total grades collected (2608 for quality and empathy; 1400 for safety metrics), with significant differences in both quality and empathy (

CONCLUSIONS: In this randomized, masked, multicenter study, LLM responses were comparable with experts in terms of quality, empathy, and safety metrics, warranting further exploration of their potential benefits in clinical settings.

FINANCIAL DISCLOSURES: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of the article.

Recommended Citation

Tailor, Prashant D.; Dalvin, Lauren A.; Chen, John J.; Iezzi, Raymond; Olsen, Timothy W.; Scruggs, Brittni A.; Barkmeier, Andrew J.; Bakri, Sophie J.; Ryan, Edwin H.; Tang, Peter H.; Parke, D. Wilkin; Belin, Peter; Sridhar, Jayanth; Xu, David; Kuriyan, Ajay E.; Yonekawa, Yoshihiro; and Starr, Matthew R., "A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone" (2024). Wills Eye Hospital Papers. Paper 217.
https://jdc.jefferson.edu/willsfp/217

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

PubMed ID

38660460

Language

English

Wills Eye Hospital Papers

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Document Type

Publication Date

Comments

Abstract

Recommended Citation

Creative Commons License

PubMed ID

Language

Included in

Browse

Search

Author Corner

About the JDC

Links

Wills Eye Hospital Papers

A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone

Authors

Document Type

Publication Date

Comments

Abstract

Recommended Citation

Creative Commons License

PubMed ID

Language

Included in

Share

Browse

Search

Author Corner

About the JDC

Links