Ai judges unreliable in dentistry

Study finds AI judges unreliable for assessing dental advice; chatbots show promise with oversight

Introduction

A recent comparative study from Xi’an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other’s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using “AI-as-a-judge” frameworks in dentistry.

What was studied and how

Researchers presented nine oral health consultation prompts, derived from FDI World Dental Federation material, covering topics such as infant oral care, pregnancy-related oral health, xerostomia in older adults, oral disease prevention and dental trauma. Six major LLMs were asked to generate responses to these scenarios. The outputs were independently evaluated and scored by two experienced dental clinicians and, separately, by three additional LLMs that the investigators used as AI judges.

Key findings

  • Performance varied substantially across models. DeepSeek-V3 and Doubao-1.8-Pro produced the strongest overall responses according to the study rubric, which assessed scientific accuracy, logical rigour, clinical practicality, terminology and completeness.
  • GPT-5, Gemini 3, Qwen3-Max and Kimi K2 also performed well overall but showed greater variability in quality.
  • Agreement between the two human clinicians was high, indicating consistent expert assessment. In contrast, consistency among the AI judges was much lower, and concordance between the AI judges and human clinicians was described as extremely poor.
  • The AI judges tended to score responses more harshly than the human experts. Nevertheless, they still failed to identify some clinically important omissions in the LLM responses, especially gaps in preventive advice and guidance for higher‑risk patient groups.
  • The authors propose that current LLM-based evaluators may overvalue language fluency and general completeness while underweighting clinical importance, risk assessment and patient‑specific cautions, reflecting reliance on textual pattern recognition rather than independent clinical reasoning.

Relevance for dental practice

The findings support a role for LLMs as tools to deliver standardised oral health information and to assist patient education, particularly where immediate access to dental professionals is limited. However, the study cautions clinicians and practice managers against delegating the quality assurance of clinical advice to AI systems alone. Human expert review remains necessary to detect substantive clinical omissions and to ensure patient safety.

Limitations and context

The study evaluated a limited set of consultation prompts and a specific set of LLMs; performance may vary with different clinical scenarios or models. The investigators did not conclude that chatbots are unsafe for general oral health information, but they emphasised that AI evaluators in their current form are not reliable substitutes for human assessment. The authors recommend future development focus on clinical reasoning, patient safety and evidence‑based decision‑making rather than on fluency alone.

SOURCE

https://www.dental-tribune.com/news/ai-dental-chatbots-still-need-human-oversight/

Leave a Reply

Your email address will not be published. Required fields are marked *

Other news

Subscription to thousands of useful articles, 600 lessons, reviews & ratings

Subscribe to the newsletter

More news in our Telegram!

Congratulations!
You have successfully registered