{"id":21888,"date":"2026-06-28T10:00:00","date_gmt":"2026-06-28T07:00:00","guid":{"rendered":"https:\/\/otexe.com\/?p=21888"},"modified":"2026-06-01T00:11:26","modified_gmt":"2026-05-31T21:11:26","slug":"ai-judges-unreliable-in-dentistry","status":"publish","type":"post","link":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/","title":{"rendered":"Ai judges unreliable in dentistry"},"content":{"rendered":"<h1>Study finds AI judges unreliable for assessing dental advice; chatbots show promise with oversight<\/h1>\n<h2>Introduction<\/h2>\n<p>A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.<\/p>\n<h2>What was studied and how<\/h2>\n<p>Researchers presented nine oral health consultation prompts, derived from FDI World Dental Federation material, covering topics such as infant oral care, pregnancy-related oral health, xerostomia in older adults, oral disease prevention and dental trauma. Six major LLMs were asked to generate responses to these scenarios. The outputs were independently evaluated and scored by two experienced dental clinicians and, separately, by three additional LLMs that the investigators used as AI judges.<\/p>\n<h2>Key findings<\/h2>\n<ul>\n<li>Performance varied substantially across models. DeepSeek-V3 and Doubao-1.8-Pro produced the strongest overall responses according to the study rubric, which assessed scientific accuracy, logical rigour, clinical practicality, terminology and completeness.<\/li>\n<li>GPT-5, Gemini 3, Qwen3-Max and Kimi K2 also performed well overall but showed greater variability in quality.<\/li>\n<li>Agreement between the two human clinicians was high, indicating consistent expert assessment. In contrast, consistency among the AI judges was much lower, and concordance between the AI judges and human clinicians was described as extremely poor.<\/li>\n<li>The AI judges tended to score responses more harshly than the human experts. Nevertheless, they still failed to identify some clinically important omissions in the LLM responses, especially gaps in preventive advice and guidance for higher\u2011risk patient groups.<\/li>\n<li>The authors propose that current LLM-based evaluators may overvalue language fluency and general completeness while underweighting clinical importance, risk assessment and patient\u2011specific cautions, reflecting reliance on textual pattern recognition rather than independent clinical reasoning.<\/li>\n<\/ul>\n<h2>Relevance for dental practice<\/h2>\n<p>The findings support a role for LLMs as tools to deliver standardised oral health information and to assist patient education, particularly where immediate access to dental professionals is limited. However, the study cautions clinicians and practice managers against delegating the quality assurance of clinical advice to AI systems alone. Human expert review remains necessary to detect substantive clinical omissions and to ensure patient safety.<\/p>\n<h2>Limitations and context<\/h2>\n<p>The study evaluated a limited set of consultation prompts and a specific set of LLMs; performance may vary with different clinical scenarios or models. The investigators did not conclude that chatbots are unsafe for general oral health information, but they emphasised that AI evaluators in their current form are not reliable substitutes for human assessment. The authors recommend future development focus on clinical reasoning, patient safety and evidence\u2011based decision\u2011making rather than on fluency alone.<\/p>\n<h2>SOURCE<\/h2>\n<p><a href=\"https:\/\/www.dental-tribune.com\/news\/ai-dental-chatbots-still-need-human-oversight\/\"><\/a>https:\/\/www.dental-tribune.com\/news\/ai-dental-chatbots-still-need-human-oversight\/<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Study finds AI judges unreliable for assessing dental advice; chatbots show promise with oversight Introduction A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots [&hellip;]<\/p>\n","protected":false},"author":20,"featured_media":21886,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"pmpro_default_level":"","footnotes":""},"categories":[240],"tags":[],"class_list":["post-21888","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-scientific-articles","pmpro-has-access"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Ai judges unreliable in dentistry - OTEXE<\/title>\n<meta name=\"description\" content=\"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Ai judges unreliable in dentistry - OTEXE\" \/>\n<meta property=\"og:description\" content=\"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\" \/>\n<meta property=\"og:site_name\" content=\"OTEXE\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/otexeworld\/reviews\" \/>\n<meta property=\"article:published_time\" content=\"2026-06-28T07:00:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"627\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"news bot\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@otexeworld\" \/>\n<meta name=\"twitter:site\" content=\"@otexeworld\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"news bot\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\"},\"author\":{\"name\":\"news bot\",\"@id\":\"https:\/\/otexe.com\/en\/#\/schema\/person\/9140d95fa9a582da764836aeeea66419\"},\"headline\":\"Ai judges unreliable in dentistry\",\"datePublished\":\"2026-06-28T07:00:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\"},\"wordCount\":466,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/otexe.com\/en\/#organization\"},\"image\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg\",\"articleSection\":[\"Scientific articles\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\",\"url\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\",\"name\":\"Ai judges unreliable in dentistry - OTEXE\",\"isPartOf\":{\"@id\":\"https:\/\/otexe.com\/en\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg\",\"datePublished\":\"2026-06-28T07:00:00+00:00\",\"description\":\"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.\",\"breadcrumb\":{\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage\",\"url\":\"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg\",\"contentUrl\":\"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg\",\"width\":1200,\"height\":627},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Main\",\"item\":\"https:\/\/otexe.com\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Scientific articles\",\"item\":\"https:\/\/otexe.com\/en\/category\/scientific-articles\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Ai judges unreliable in dentistry\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/otexe.com\/en\/#website\",\"url\":\"https:\/\/otexe.com\/en\/\",\"name\":\"OTEXE\",\"description\":\"Is a community where the best minds in dentistry are creating the future of the industry\",\"publisher\":{\"@id\":\"https:\/\/otexe.com\/en\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/otexe.com\/en\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/otexe.com\/en\/#organization\",\"name\":\"Otexe\",\"url\":\"https:\/\/otexe.com\/en\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/otexe.com\/en\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/otexe.com\/wp-content\/uploads\/2025\/07\/logo.png\",\"contentUrl\":\"https:\/\/otexe.com\/wp-content\/uploads\/2025\/07\/logo.png\",\"width\":697,\"height\":697,\"caption\":\"Otexe\"},\"image\":{\"@id\":\"https:\/\/otexe.com\/en\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/otexeworld\/reviews\",\"https:\/\/x.com\/otexeworld\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/otexe.com\/en\/#\/schema\/person\/9140d95fa9a582da764836aeeea66419\",\"name\":\"news bot\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/otexe.com\/en\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/126946fb002d82862ee4daf1c3c87cd017acab2d58cb90a684793b86e7347e3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/126946fb002d82862ee4daf1c3c87cd017acab2d58cb90a684793b86e7347e3d?s=96&d=mm&r=g\",\"caption\":\"news bot\"},\"url\":\"https:\/\/otexe.com\/en\/author\/news-bot\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Ai judges unreliable in dentistry - OTEXE","description":"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/","og_locale":"en_US","og_type":"article","og_title":"Ai judges unreliable in dentistry - OTEXE","og_description":"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.","og_url":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/","og_site_name":"OTEXE","article_publisher":"https:\/\/www.facebook.com\/otexeworld\/reviews","article_published_time":"2026-06-28T07:00:00+00:00","og_image":[{"width":1200,"height":627,"url":"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg","type":"image\/jpeg"}],"author":"news bot","twitter_card":"summary_large_image","twitter_creator":"@otexeworld","twitter_site":"@otexeworld","twitter_misc":{"Written by":"news bot","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#article","isPartOf":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/"},"author":{"name":"news bot","@id":"https:\/\/otexe.com\/en\/#\/schema\/person\/9140d95fa9a582da764836aeeea66419"},"headline":"Ai judges unreliable in dentistry","datePublished":"2026-06-28T07:00:00+00:00","mainEntityOfPage":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/"},"wordCount":466,"commentCount":0,"publisher":{"@id":"https:\/\/otexe.com\/en\/#organization"},"image":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage"},"thumbnailUrl":"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg","articleSection":["Scientific articles"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/","url":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/","name":"Ai judges unreliable in dentistry - OTEXE","isPartOf":{"@id":"https:\/\/otexe.com\/en\/#website"},"primaryImageOfPage":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage"},"image":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage"},"thumbnailUrl":"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg","datePublished":"2026-06-28T07:00:00+00:00","description":"A recent comparative study from Xi\u2019an, China, evaluated the performance of leading large language models (LLMs) in providing oral health consultations and examined whether AI systems can reliably assess each other\u2019s clinical responses. The work highlights the potential of chatbots to deliver standardised oral health information while emphasising persistent limitations in using \u201cAI-as-a-judge\u201d frameworks in dentistry.","breadcrumb":{"@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#primaryimage","url":"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg","contentUrl":"https:\/\/otexe.com\/wp-content\/uploads\/2026\/05\/ai-judges-not-reliable-for-evaluating-dental-advice.jpg","width":1200,"height":627},{"@type":"BreadcrumbList","@id":"https:\/\/otexe.com\/en\/ai-judges-unreliable-in-dentistry\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Main","item":"https:\/\/otexe.com\/en\/"},{"@type":"ListItem","position":2,"name":"Scientific articles","item":"https:\/\/otexe.com\/en\/category\/scientific-articles\/"},{"@type":"ListItem","position":3,"name":"Ai judges unreliable in dentistry"}]},{"@type":"WebSite","@id":"https:\/\/otexe.com\/en\/#website","url":"https:\/\/otexe.com\/en\/","name":"OTEXE","description":"Is a community where the best minds in dentistry are creating the future of the industry","publisher":{"@id":"https:\/\/otexe.com\/en\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/otexe.com\/en\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/otexe.com\/en\/#organization","name":"Otexe","url":"https:\/\/otexe.com\/en\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/otexe.com\/en\/#\/schema\/logo\/image\/","url":"https:\/\/otexe.com\/wp-content\/uploads\/2025\/07\/logo.png","contentUrl":"https:\/\/otexe.com\/wp-content\/uploads\/2025\/07\/logo.png","width":697,"height":697,"caption":"Otexe"},"image":{"@id":"https:\/\/otexe.com\/en\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/otexeworld\/reviews","https:\/\/x.com\/otexeworld"]},{"@type":"Person","@id":"https:\/\/otexe.com\/en\/#\/schema\/person\/9140d95fa9a582da764836aeeea66419","name":"news bot","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/otexe.com\/en\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/126946fb002d82862ee4daf1c3c87cd017acab2d58cb90a684793b86e7347e3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/126946fb002d82862ee4daf1c3c87cd017acab2d58cb90a684793b86e7347e3d?s=96&d=mm&r=g","caption":"news bot"},"url":"https:\/\/otexe.com\/en\/author\/news-bot\/"}]}},"_links":{"self":[{"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/posts\/21888","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/users\/20"}],"replies":[{"embeddable":true,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/comments?post=21888"}],"version-history":[{"count":2,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/posts\/21888\/revisions"}],"predecessor-version":[{"id":21891,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/posts\/21888\/revisions\/21891"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/media\/21886"}],"wp:attachment":[{"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/media?parent=21888"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/categories?post=21888"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/otexe.com\/en\/wp-json\/wp\/v2\/tags?post=21888"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}