|Year : 2023 | Volume
| Issue : 1 | Page : 17
Performance of chatGPT-3.5 answering questions from the Brazilian Council of Ophthalmology Board Examination
Mauro C Gobira1, Rodrigo C Moreira1, Luis F Nakayama2, Caio V S. Regatieri3, Eric Andrade4, Rubens Belfort Jr4
1 Vision Institute, Instituto Paulista de Estudos e Pesquisas em Oftalmologia, São Paulo, SP, Brazil
2 Vision Institute, Instituto Paulista de Estudos e Pesquisas em Oftalmologia; Department of Ophthalmology, Federal University of São Paulo, São Paulo, SP, Brazil; Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
3 Department of Ophthalmology, Federal University of São Paulo, São Paulo, SP, Brazil
4 Vision Institute, Instituto Paulista de Estudos e Pesquisas em Oftalmologia; Department of Ophthalmology, Federal University of São Paulo, São Paulo, SP, Brazil
|Date of Submission||05-Apr-2023|
|Date of Decision||10-Apr-2023|
|Date of Acceptance||17-Apr-2023|
|Date of Web Publication||11-May-2023|
Luis F Nakayama
Rua Botucatu, 821, Vila Clementino, 04023-062, São Paulo
Source of Support: None, Conflict of Interest: None
Importance: Large language models being approved in medical boarding examinations highlights problems and challenges for healthcare education, improving tests, and deployment of chatbots.
Objective: The objective of this study was to evaluate the performance of ChatGPT-3.5 in answering the Brazilian Council of Ophthalmology Board Examination.
Material and Methods: Two independent ophthalmologists inputted all questions to ChatGPT-3.5 and evaluated the responses for correctness, adjudicating in disagreements. We compared the performance of the ChatGPT across tests, ophthalmological themes, and mathematical questions. The included test was the 2022 Brazilian Council of Ophthalmology Board Examination which consists of theoretical test I, II, and theoretical–practical.
Results: ChatGPT-3.5 answered 68 (41.46%) questions correctly, 88 (53.66%) wrongly, and 8 (4.88%) undetermined. In questions involving mathematical concepts, artificial intelligence correctly answered 23.8% of questions. In theoretical examinations I and II, correctly answered 43.18% and 40.83%, respectively. There was no statistical difference (odds ratio 1.101, 95% confidence interval 0.548–2.215, P = 0.787) comparing correct answers between tests and comparing within the test themes (P = 0.646) for correct answers.
Conclusion and Relevance: Our study shows that ChatGPT would not succeed in the Brazilian ophthalmological board examination, a specialist-level test, and struggle with mathematical questions. Poor performance of ChatGPT can be explained by a lack of adequate clinical data in training and problems in question formulation, with caution recommended in deploying chatbots for ophthalmology.
Keywords: Artificial intelligence, natural language processing, standards, specialty boards
|How to cite this article:|
Gobira MC, Moreira RC, Nakayama LF, S. Regatieri CV, Andrade E, Jr RB. Performance of chatGPT-3.5 answering questions from the Brazilian Council of Ophthalmology Board Examination. Pan Am J Ophthalmol 2023;5:17
|How to cite this URL:|
Gobira MC, Moreira RC, Nakayama LF, S. Regatieri CV, Andrade E, Jr RB. Performance of chatGPT-3.5 answering questions from the Brazilian Council of Ophthalmology Board Examination. Pan Am J Ophthalmol [serial online] 2023 [cited 2023 Sep 27];5:17. Available from: https://www.thepajo.org/text.asp?2023/5/1/17/376676
| Introduction|| |
In recent years, artificial intelligence (AI) has become increasingly present in daily activities, permeating different areas of society such as social networks, product recommendations, and health care.,, Among deep learning algorithms, natural language processing (NLP) stands out as capable of process and analyze text and speech. Chatbots are systems that simulate a human conversation, with NLP as the key component for interactive conversations. Chatbots have demonstrated significant potential for application in various health-care settings that traditionally require in-person interactions. By leveraging dialog management and conversational flexibility, integrating chatbot technology into clinical practice can lead to cost reductions and streamlined workflows.,
ChatGPT is powered by GPT3.5 (Generative Pretrained Transformer–OpenAI, San Francisco, CA, USA) and comprises a large language model (LLM) in a user-friendly interface. Expanding traditional NLP models, LLM algorithms are designed to learn and recognize patterns in data and predict the next sequence of words based on the context of previous data. For deployment, as a chat, the ChatGPT was fined-tuned using a human-supervised tune applied in a reinforcement learning model. In this context, GPT3.5 included 175 billion parameters of training data and texts from the Internet, including medical articles.
Several studies have already evaluated the performance of AI models in answering questions in different areas of medicine.,,,, However, this is the first to evaluate the performance in answering specialist-level questions, such as in the case of the Brazilian Council of Ophthalmology Board Examination (Prova Nacional de Oftalmologia– [PNO]), a mandatory examination for Brazilian ophthalmologists.
This study aims to evaluate the performance of ChatGPT-3.5 in answering the Brazilian Ophthalmology boarding examination and assess the performance of ChatGPT in answering questions across domain-specific themes.
| Materials and Methods|| |
In this study, we used ChatGPT powered by GPT version 3.5, the most accessible and widely used GPT version covering data until September 2021. The examination data were collected from the 2022 Brazilian Board of Ophthalmology examination, which is publicly available on the CBO website. The PNO consists of three distinct sections: theoretical test I, comprising basic ophthalmological sciences, theoretical test II, comprising clinical–surgical ophthalmology, and theoretical–practical test, comprising clinical cases with videos and images.
The 2022 board examination consisted of 50 questions in the theoretical test I, 125 questions in the theoretical test II, and 50 questions in the theoretical–practical test. We included all questions from the theoretical sections and excluded the theoretical–practical test, questions that include images or videos, and questions nullified.
The questions were categorized according to ophthalmological themes in the following subgroups: anatomy, pharmacology, embryology, glaucoma, retina, cornea, clinical optics, oculoplastics/orbit, neuro-ophthalmology, strabismus, cataract, uveitis, oncology, refractive surgery, contact lens, and genetics. The selected questions were converted to plain text, and the input in the ChatGPT-3.5 was in the Portuguese language. Multiple choice questions were entered in full without forced justification. To reduce memory retention bias, a new chat session was started in ChatGPT for each entry.
Two ophthalmologists (MCGGF and RGCM) independently submitted all the questions and scored for accuracy [Figure 1]. The accurateness of answers was evaluated by comparing them with the examination key and classifying the responses as adequate, inadequate, or indeterminate. Responses were considered “adequate” when the final answer was aligned with the CBO responses. “Inadequate” answers were defined as instances where an incorrect answer option was chosen. Responses were deemed “indeterminate” when they matched one of two categories: (1) The answer was not present in the set of available options or (2) the AI determined that there was insufficient data to provide a confident answer. After individual evaluation, the readers adjudicated disagreements to achieve a final decision. We also included the accuracy of the responses that the ChatGPT-3.5 presented to questions that included mathematical concepts.
For statistical analysis, indeterminate and inadequate answers were grouped as “wrong” and adequate answers as “correct,” and the test approval rate was 65%. Descriptive and Chi-square statistical analyses were performed to compare the performance across the tests and ophthalmological themes. To assess the influence of each ophthalmological theme on ChatGPT-3.5's success in the test, we performed a univariate logistic regression analysis. The inter-rater agreement between ophthalmologists was evaluated by computing the Cohen kappa (κ) statistic.
In all hypothesis testing, a two-sided test was employed with a statistical significance level of α = 0.05, and statistical analysis was performed using Python 3.9 libraries.
| Results|| |
A total of 164 questions were included for evaluation, with 44 in the theoretical examination I and 120 in theoretical examination II. A total of 11 questions were excluded from the study (seven nullified and four images).
The Cohen's Kappa agreement between evaluators was 0.671 (substantial agreement), with 40 adjudicated answers. In the adjudicated results, the ChatGPT answered 68 (41.46%) questions correctly, 88 (53.66%) wrongly, and 8 (4.88%) undetermined. In theoretical examinations I and II, correctly answered 19 (43.18%) and 49 (40.83%), respectively. There was no statistical difference (odds ratio 1.101, 95% confidence interval 0.548–2.215, P = 0.787) comparing correct answers between theoretical examinations I and II.
A comparison of ChatGPT-3.5 results through different ophthalmology themes showed better performance in refractive surgery (100%) and oncology (100%) and the worst performance in cataracts (25%) and retina (23.08%). Among the mathematical questions, the ChatGPT correctly answered 21 (23.8%) questions. There was no statistical difference within the groups (P = 0.646) for correct answers. In the logistic regression comparison to succeed in the examination, no group showed statistical significance. The comparison within groups is described in [Figure 2].
|Figure 2: ChatGPT across ophthalmological themes, showing the number of correct answers|
Click here to view
| Discussion|| |
Our study shows that ChatGPT-3.5 would fail the Brazilian Ophthalmological board examination. The AI system had an accuracy of 40.2% in 164 specialist-level ophthalmology questions. Generative language models have been showing promising results in succeeding in medical examinations, including the United States Medical Licensing Examination.
Comparing the ophthalmology themes was a big discrepancy in the accuracy of the responses, with a high variation ranging from 100% to 25%. Different question characteristics, for example, being more straightforward and less ambiguous in refractive surgery, contribute to varying performances. The lack of adequate clinical data during model training explains the poor performance of ChatGPT-3.5 in a specialist-level examination.
Similar to previous reports, in PNO, mathematical questions that include formulas and calculations were not well interpreted by the ChatGPT-3.5, with an average accuracy rate of 23.8%, not better than chance.
More recently, GPT-4 has been designed to overcome the limitations of earlier versions, is currently in beta testing, and is not yet widely available for commercial use. The GPT4 still faces the challenge of being unreliable, with a tendency to generate erroneous information and commit reasoning errors, commonly referred to as “hallucinations.”
LLM's success on medical board examinations highlights the imperfections of the medical education system; however, the poor performance reported in the Brazilian Ophthalmological board examination may spotlight problems of ambiguity and inconsistency in applied questions.
| Conclusion|| |
Our study shows that ChatGPT would not succeed in the Brazilian ophthalmological board examination, a specialist-level test, and struggle with mathematical questions. The poor performance of ChatGPT can be explained by a lack of adequate clinical data in training and problems in question formulation, with caution recommended in deploying chatbots for ophthalmology. Additional studies are needed to verify the accuracy of ChatGPT across different specialist-level medical exams.
This article was never submitted to meetings and publications. The final version was approved by all authors.
Financial support and sponsorship
LFN is a researcher supported by Lemann Foundation, Instituto da Visão-IPEPO.
Conflicts of interest
There are no conflicts of interest.
| References|| |
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al.
Artificial intelligence in healthcare: Past, present and future. Stroke Vasc Neurol 2017;2:230-43.
Bawack RE, Wamba SF, Carillo KD, Akter S. Artificial intelligence in E-Commerce: A bibliometric study and literature review. Electron Mark 2022;32:297-338.
Névéol A, Zweigenbaum P. Clinical natural language processing in 2014: Foundational methods supporting efficient healthcare. Yearb Med Inform 2015;10:194-8.
Abashev A, Grigoryev R, Grigorian K, Boyko V. Programming tools for messenger-based chatbot system organization: Implication for outpatient and translational medicines. Bionanoscience 2017;7:403-7.
Oh KJ, Lee D, Ko B, Choi HJ. A Chatbot for Psychiatric Counseling in Mental Healthcare Service Based on Emotional Dialogue Analysis and Sentence Generation. 2017 18th
IEEE International Conference on Mobile Data Management (MDM); 2017. p. 371-5.
Liévin V, Hother CE, Winther O. Can large language models reason about medical questions? ArXiv [CsCL] 2022. doi.org/10.48550/arXiv.2207.08143.
Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med 2019;380:1347-58.
Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat Med 2019;25:44-56.
Obermeyer Z, Emanuel EJ. Predicting the future – Big data, machine learning, and clinical medicine. N Engl J Med 2016;375:1216-9.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al.
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198.
[Figure 1], [Figure 2]