AUTOMATING FRACTURE DETECTION: BENCHMARKING LANGUAGE MODELS AGAINST SPECIALIZED AI IN PLAIN RADIOGRAPHS

Biavardi, Nicolò Giuseppe; Placella, G.; Alessio-Mazzola, Mattia; Conca, Marco; Salini, V.

Comparative Study

AUTOMATING FRACTURE DETECTION: BENCHMARKING LANGUAGE MODELS AGAINST SPECIALIZED AI IN PLAIN RADIOGRAPHS

N.G. Biavardi¹

, G. Placella¹

, M. Alessio-Mazzola²

, M. Conca², V. Salini¹

¹ Vita-Salute University, IRCCS San Raffaele Hospital, Milan, Italy
² IRCCS San Raffaele Hospital, Milan, Italy

Correspondence to:

Mattia Alessio Mazzola, MD
IRCCS Ospedale San Raffaele,
Unità Clinica di Ortopedia e Traumatologia
Via Olgettina 60,
20132, Milano, Italy

Journal of Orthopedics 2024 September-December; 16(3): 118-125
https://doi.org/10.69149/orthopedics/2024v16iss3_3

Received: 13 August 2024 Accepted: 18 September 2024

This publication and/or article is for individual use only and may not be further reproduced without written permission from the copyright holder. Unauthorized reproduction may result in financial and other penalties. Disclosure: All authors report no conflicts of interest relevant to this article.

Download PDF

Abstract

This study aims to compare the diagnostic capabilities of the emerging natural language AI model, ChatGPT, with Qure.ai, an established reference standard AI model, in the classification of fractures from plain radiographs. Employing a retrospective cross-sectional design, this diagnostic accuracy study was set in the Orthopedic Department of IRCSS San Raffaele Milano. A sample of 200 de-identified anteroposterior and lateral femur radiographs was utilized, equally divided into fractured and normal. Two AI models independently evaluated the radiographs, classifying them as fractured or normal, against the radiologist reports serving as the reference standard. The reference standard AI, Qure.ai, exhibited a marginally superior sensitivity (0.89 vs 0.73, p<0.01) and overall accuracy (0.92 vs 0.84) compared to ChatGPT. Both models demonstrated high specificity (>0.90), with the reference AI achieving closer-to-ideal diagnostic discrimination (AUC 0.92 vs 0.84). Fracture complexity diminished accuracy, and a strong inter-model concordance was noted. Both AI models showed a performance surpassing established clinical benchmarks, with the reference AI model slightly outperforming ChatGPT. The study’s robust methodological framework offers essential insights for the clinical application of AI in radiographic fracture diagnosis. Further studies, particularly expanded multi-center trials, are recommended to validate these findings.

Keywords: AI fracture detection, artificial intelligence, ChatGpt, femur fracture, LLM

AUTOMATING FRACTURE DETECTION: BENCHMARKING LANGUAGE MODELS AGAINST SPECIALIZED AI IN PLAIN RADIOGRAPHS

Abstract

Journal of Orthopedics

CALL FOR PAPERS

Journals contacts