Brown Neurosurgery Department tests artificial intelligence models on written, oral neurosurgery board exam questions

The Brown Neurosurgery Department recently published two preprints comparing the performances of Artificial Intelligence Large Language Models ChatGPT, GPT-4 and Google Bard in the neurosurgery board examinations and the neurosurgery board preparatory question bank.

They found that these AI models were able to pass the written exams with 鈥渇lying colors.鈥� When challenged to answer the more complicated oral exam questions, which require higher-order thinking based on clinical experience and exposure, the models still performed 鈥渟uperbly,鈥� said Ziya Gokaslan, professor and chair of neurosurgery at the Warren Alpert Medical School and neurosurgeon-in-chief at Rhode Island Hospital and The Miriam Hospital.聽

Since its publication, the preprint focused on the oral board exam questions has ranked in the 99th percentile of the , which has tracked the amount of attention received by over 23 million online research outputs.

鈥淚t鈥檚 such an exploding story in the world and in medicine,鈥� said Warren Alpert Professor of Neurosurgery Albert Telfeian, who is also the director of minimally invasive endoscopic spine surgery at RIH and director of pediatric neurosurgery at Hasbro Children鈥檚 Hospital.

Inspiration for the study and key findings

The project was inspired when fifth-year Neurosurgery Resident and Co-first author Rohaid Ali was studying for his neurosurgery board exam with his close friend from Stanford Medical School, Ian Connolly, another co-first author and 4th year neurosurgery resident at Massachusetts General Hospital. They had seen that ChatGPT was able to pass other standardized exams like the , and wanted to test whether ChatGPT could answer any of the questions on their exam.聽

This prompted Ali and Connolly to execute these studies in collaboration with their third co-first author, Oliver Tang 鈥�19 MD鈥�23. They found that GPT-4 was 鈥渂etter than the average human test taker鈥� and ChatGPT and Google Bard were at the 鈥渓evel of the average neurosurgery resident who took these mock exams,鈥� Ali said.

鈥淥ne of the most interesting aspects鈥� of the study was the comparison between the AI models, as there have been 鈥渧ery few structured head-to-head comparisons of (them) in any fields,鈥� said Wael Asaad, associate professor of neurosurgery and neuroscience at Warren Alpert and director of the functional epilepsy and neurosurgery program at RIH. The findings are 鈥渞eally exciting beyond just neurosurgery,鈥� he added.聽

The article found that GPT-4 outperformed the other LLMs, receiving a score of 82.6% on a series of higher-order case management scenarios presented in mock neurosurgery oral board exam questions.

Asaad noted that GPT-4 was expected to outperform ChatGPT 鈥斅爓hich came out before GPT-4 鈥斅燼s well as Google Bard. 鈥淕oogle sort of rushed to jump in and 鈥� that rush shows in the sense that (Google Bard) doesn鈥檛 perform nearly as well.鈥�

But these models still have limitations: As text-based models cannot see images, they scored significantly lower in imaging-related questions that require higher-order reasoning. They also asserted false facts, referred to as 鈥渉allucinations," in answers to these questions.

One question, for example, presented an image of a highlighted portion of an arm and asked which nerve innervated the sensory distribution in the area. GPT-4 correctly assessed that it could not answer the question because it is a text-based model and could not view the image, while Google Bard responded with an answer that was 鈥渃ompletely made up,鈥� Ali said.

鈥淚t鈥檚 important to address the viral social media attention that these (models) have gained, which suggest that (they) could be a brain surgeon, but also important to clarify that these models are not yet ready for primetime and should not be considered a replacement for human activities currently,鈥� Ali added. 鈥淎s neurosurgeons, it鈥檚 crucial that we safely integrate AI models for patient usage and actively investigate their blind spots to ensure the best possible care for the patients.鈥�

Asaad added that in real clinical scenarios, neurosurgeons could receive misleading or irrelevant information. The LLMs 鈥渄on鈥檛 perform very well in these real-world scenarios that are more open-ended and less clear cut,鈥� he said.

Ethical considerations with medicine and AI

There were also instances where the AI model鈥檚 correct response to certain scenarios surprised the researchers.

For one question about a severe gunshot injury to the head, the answer was that there is likely no surgical intervention that would meaningfully alter the trajectory of the disease course. 鈥淔ascinatingly, these AI chatbots were willing to select that answer,鈥� Ali said.

鈥淭hat鈥檚 something that we didn鈥檛 expect (and) something that鈥檚 worth considering,鈥� Ali said. 鈥淚f these AI models were going to be giving us ethical recommendations in this area, what implications does that have for our field or field of medicine more broadly?鈥�

Get The Herald delivered to your inbox daily.

Another concern is that these models are trained on data from clinical trials that have historically underrepresented certain disadvantaged communities. 鈥淲e must be vigilant about potential risks of propagating health disparities and address these biases 鈥� to prevent harmful recommendations,鈥� Ali said.

Asaad added that 鈥渋t鈥檚 not something that鈥檚 unique to those systems 鈥� a lot of humans have bias 鈥� so it鈥檚 just a matter of trying to understand that bias and engineer it out of the system.鈥�

Telfeian also addressed the importance of human connections between doctors and patients that AI models still lack. 鈥淚f your doctor established some common ground with you 鈥� to say 鈥榦h, you鈥檙e from here, or you went to this school鈥� 鈥� then suddenly you鈥檙e more willing to accept what they would recommend,鈥� he said.

鈥淭aking the surgeon out of the equation is not in the foreseeable future,鈥� said Curt Doberstein, professor of neurosurgery at Warren Alpert and director of cerebrovascular surgery at RIH. 鈥淚 see (AI) as a great aid to both patients and physicians, but there are just a lot of capabilities that don鈥檛 exist yet.鈥�

Future of AI in medicine

With regard to the future of AI models in medicine, Asaad predicted that 鈥渢he human factor will slowly be dialed back, and anybody who doesn't see it that way, who thinks that there鈥檚 something magical about what humans do 鈥� is missing the deeper picture of what it means to have intelligence."

鈥淚ntelligence isn鈥檛 magic. It鈥檚 just a process that we are beginning to learn how to replicate in artificial systems,鈥� Asaad said.

Asaad also said that he sees future applications of AI in serving as assistants to medical providers.聽

Because the field of medicine is rapidly advancing, it is difficult for providers to keep up with new developments that would help them evaluate cases, he said. AI models could 鈥済ive you ideas or resources that are relevant to the problem that you鈥檙e facing clinically.鈥�

Doberstein also noted the role of AI assisting with patient documentation and communication to help alleviate provider burnout, increase patient safety and promote doctor-patient interactions.聽

Gokaslan added that 鈥渢here鈥檚 no question that these systems will find their way into medicine and surgery, and I think they鈥檙e going to be extremely helpful, but I think we need to be careful in testing these effectively and using it thoughtfully.鈥�

鈥淲e鈥檙e at the tip of the iceberg 鈥� these things just came out,鈥� Doberstein said. 鈥淚t鈥檚 going to be a process where everybody in science is going to constantly have to learn and adapt to all the new technology and changes that come about.鈥�

鈥淭hat鈥檚 the exciting part,鈥� Doberstein added.

Gabriella Vulakh

Gabriella is the Senior Science & Research Editor of The Brown Daily Herald. She is a junior from San Francisco studying neuroscience on the premedical track.聽

猫咪社区

Brown Neurosurgery Department tests artificial intelligence models on written, oral neurosurgery board exam questions

AI models pass with 鈥榝lying colors鈥� on written exam, perform 鈥榮uperbly鈥� on oral exam

Study shows effectiveness of COVID-19 convalescent plasma therapy in early disease stages, outpatient settings

Aizenberg '26: Brown should increase its acceptance rate. Here鈥檚 why.

Educators, scholars remember lives of Lani Guinier, bell hooks at recent symposium