Ai judges unreliable in dentistry

Introduction

A recent comparative study from Xi’an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other’s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using “AI-as-a-judge” frameworks in dentistry.

What was studied and how

Researchers presented nine oral health consultation prompts, derived from FDI World Dental Federation material, covering topics such as infant oral care, pregnancy-related oral health, xerostomia in older adults, oral disease prevention and dental trauma. Six major LLMs were asked to generate responses to these scenarios. The outputs were independently evaluated and scored by two experienced dental clinicians and, separately, by three additional LLMs that the investigators used as AI judges.

Key findings

Performance varied substantially across models. DeepSeek-V3 and Doubao-1.8-Pro produced the strongest overall responses according to the study rubric, which assessed scientific accuracy, logical rigour, clinical practicality, terminology and completeness.

GPT-5, Gemini 3, Qwen3-Max and Kimi K2 also performed well overall but showed greater variability in quality.

Agreement between the two human clinicians was high, indicating consistent expert assessment. In contrast, consistency among the AI judges was much lower, and concordance between the AI judges and human clinicians was described as extremely poor.

The AI judges tended to score responses more harshly than the human experts. Nevertheless, they still failed to identify some clinically important omissions in the LLM responses, especially gaps in preventive advice and guidance for higher‑risk patient groups.

The authors propose that current LLM-based evaluators may overvalue language fluency and general completeness while underweighting clinical importance, risk assessment and patient‑specific cautions, reflecting reliance on textual pattern recognition rather than independent clinical reasoning.

Relevance for dental practice

The findings support a role for LLMs as tools to deliver standardised oral health information and to assist patient education, particularly where immediate access to dental professionals is limited. However, the study cautions clinicians and practice managers against delegating the quality assurance of clinical advice to AI systems alone. Human expert review remains necessary to detect substantive clinical omissions and to ensure patient safety.

Limitations and context

The study evaluated a limited set of consultation prompts and a specific set of LLMs; performance may vary with different clinical scenarios or models. The investigators did not conclude that chatbots are unsafe for general oral health information, but they emphasised that AI evaluators in their current form are not reliable substitutes for human assessment. The authors recommend future development focus on clinical reasoning, patient safety and evidence‑based decision‑making rather than on fluency alone.

Artificial intelligence in implant prosthetics: the role of evidence synthesis and transformation of clinical navigation

30.06.2026

Modern implantology faces an exponential increase in scientific data, requiring clinicians to constantly adapt and filter information. A new study,

Dental opioid dispensing falls; us still highest

29.06.2026

US dental opioid dispensing falls markedly 2021–2024 but remains highest among peers Introduction / background A multinational analysis published online