Krayknot

In a study presented at the European Respiratory Society (ERS) Congress in Vienna, ChatGPT outperformed trainee doctors in evaluating complex respiratory conditions, including cystic fibrosis, asthma, and chest infections.

The research revealed that Google's Bard chatbot also surpassed trainees in certain areas, while Microsoft's Bing performed comparably to them. These findings suggest that large language models (LLMs) like ChatGPT could assist trainee doctors, nurses, and general practitioners in triaging patients more efficiently and alleviating pressure on healthcare services.

Dr. Manjith Narayanan, a pediatric pulmonology consultant at the Royal Hospital for Children and Young People in Edinburgh and honorary senior clinical lecturer at the University of Edinburgh, led the study. He noted, "Large language models have gained prominence recently due to their ability to simulate human-like conversation. This research aimed to assess their potential to assist clinicians in real-world scenarios."

The study involved clinical scenarios from pediatric respiratory medicine, including cystic fibrosis and asthma, provided by six experts. Ten trainee doctors with less than four months of pediatric experience were tasked with solving these scenarios using internet resources, while the same scenarios were also presented to three chatbots.

Responses were evaluated by six pediatric respiratory experts for accuracy, comprehensiveness, usefulness, plausibility, and coherence. ChatGPT version 3.5 achieved an average score of seven out of nine and was considered more human-like than the other chatbots. Bard scored an average of six out of nine, showing greater coherence than trainee doctors but similar performance overall. Bing scored an average of four out of nine, equal to the trainee doctors, and its responses were more readily identified as non-human.

Dr. Narayanan commented, "This is the first study we know of that tests LLMs against trainee doctors in real-life clinical situations, focusing on practical application rather than just memory. It highlights the potential for LLMs to assist in everyday clinical practice."

The study found no significant instances of "hallucinations" (false information) in any of the chatbots, though Dr. Narayanan emphasized the need for awareness and mitigation of such issues. The researchers plan to further test chatbots against senior doctors and explore newer LLMs.

Hilary Pinnock, ERS Education Council Chair and Professor of Primary Care Respiratory Medicine at the University of Edinburgh, commented on the study's implications: "While it's promising to see AI like ChatGPT tackle complex cases, we must ensure that it does not introduce errors or biases before integrating it into routine clinical practice. Extensive testing and evaluation are essential to ensure clinical accuracy, organizational efficiency, and societal impact."