Before you ask AI about symptoms, read this Oxford warning

A new study conducted by the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford, in partnership with MLCommons and other institutions, highlights a substantial gap between the promise of LLMs and their real-world usefulness for people seeking medical advice.

While LLMs now perform impressively on standardised medical knowledge tests, the study found that they struggle when supporting individuals with their own symptoms in real-life scenarios, according to the Oxford Internet Institute.

Key Findings

No better than traditional methods

Participants were asked to identify potential health conditions and determine appropriate next steps — such as visiting a GP or going to the hospital — based on detailed medical scenarios developed by doctors. Those using LLMs did not make better decisions than participants who relied on traditional sources such as online searches or their own judgment.

Communication breakdown

The study revealed a two-way communication gap. Participants often did not know what specific information the LLMs required to provide accurate advice. Meanwhile, the responses generated frequently mixed accurate recommendations with misleading or incorrect guidance, making it difficult for users to determine the safest course of action.

Existing tests fall short

Researchers found that current LLM evaluation methods fail to reflect the complexity of real-world human interaction. The study argues that, similar to clinical trials for new medications, AI systems intended for healthcare use should undergo rigorous real-world testing before deployment.

“These findings highlight the difficulty of building AI systems that can genuinely support people in sensitive, high-stakes areas like health,” said Dr Rebecca Payne, GP and lead medical practitioner on the study, Clarendon-Reuben Doctoral Scholar at the Nuffield Department of Primary Care Health Sciences, and Clinical Senior Lecturer at Bangor University.

“Despite all the hype, AI just isn’t ready to take on the role of the physician. Patients need to understand that asking a large language model about their symptoms can be dangerous, potentially leading to incorrect diagnoses and failure to recognise when urgent medical attention is required.”

Real users, real challenges

The researchers conducted a randomised controlled trial involving nearly 1,300 online participants. Individuals were presented with realistic medical scenarios developed by doctors — ranging from a young man experiencing a severe headache after a night out, to a new mother suffering from persistent breathlessness and exhaustion.

One group used an LLM to assist with decision-making, while a control group relied on traditional information sources. Researchers then assessed how accurately participants identified likely conditions and selected appropriate next steps, such as visiting a GP or attending A&E.

The team also compared these real-world interaction outcomes with standard LLM benchmark testing results. The contrast was striking: models that performed well on standardised tests faltered when interacting with real users.

The study identified three core challenges:

Users often did not know what information to provide to the LLM.

LLMs produced significantly different answers based on small variations in how questions were phrased.

Responses frequently contained a mixture of accurate and inaccurate information, which users struggled to distinguish.

“Designing robust testing for large language models is essential to understanding how we can safely use this technology,” said lead author Andrew Bean, a doctoral researcher at the Oxford Internet Institute. “Our findings show that interaction with humans presents a significant challenge even for leading LLMs. We hope this work contributes to the development of safer and more effective AI systems.”

Senior author Dr Adam Mahdi, Associate Professor at the Reasoning with Machines Lab (OxRML) at the Oxford Internet Institute, added: “The disconnect between benchmark scores and real-world performance should serve as a wake-up call for AI developers and regulators. Many current evaluations fail to measure what they claim to assess, and this study demonstrates why that matters. We cannot rely solely on standardised tests to determine whether these systems are safe for public use. Just as new medicines require clinical trials, AI systems must undergo rigorous testing with diverse, real users to fully understand their capabilities in high-stakes environments like healthcare.”

Before you ask AI about symptoms, read this Oxford warning

IBNS

Related Articles

‘Success can’t be measured by clean roads or good salaries’: Why an NRI left Canada for India

Mrinank Sharma: AI expert walks away from Anthropic, reveals surprising reason

‘My savings rate hit 90%’: Former Microsoft techie on why moving back to India was a life upgrade

India AI Impact Summit pushes five-star hotel rates in Delhi into lakhs

Latest News

Macron’s India visit confirmed: AI, Indo-Pacific and Horizon 2047 talks on agenda

Tripura launches PMKVY 4.0 skill programmes with IIT Delhi; MoUs signed with IBM and HCL

Hindu businessman hacked to death days before Bangladesh polls, minority safety fears rise

Poster of Bengali film ‘Phonibabu Viral’ unveiled, explores social media’s grip on everyday life

From defence to digital: India’s $175 million surprise package for Seychelles