Simulation study of frontier AI models analyzing 10,000 synthetic Multiple Sclerosis cases reveals variable diagnostic safety
This publication is a simulation study involving 10,000 synthetic Multiple Sclerosis cases, expanded from an initial 1,000. Frontier artificial intelligence models, specifically Gemini 3 Pro/Flash and GPT 5.2/5 mini, were instructed to analyze these cases. The models were compared against ground-truth labels and blinded subspecialty experts to assess decision-making regarding anatomical localization, differential diagnoses, investigations, and management plans. The study population consisted entirely of synthetic cases rather than real patients.
Key findings indicate that while synthetic case realism was 100% confirmed by subspecialist expert review and automated evaluation accuracy was 99.8% (95% CI 95.5 to 100), specific clinical recommendations varied widely. For instance, clinically appropriate steroid recommendations ranged from 7.2% (95% CI 5.6 to 8.8) for Gemini 3 Flash to 23.5% (95% CI 20.8 to 26.1) for GPT 5 mini. Intravenous thrombolysis recommendations were below 1% for Gemini models but reached 9.6% for GPT 5.2. Notably, thrombolysis recommendations occurred in 10.1% of cases lacking symptom timing information and 2.9% when symptoms were documented as more than 14 days old. MS inclusion in differential diagnoses occurred in more than 91% of cases.
The authors note that evaluations largely rely on small collections of cases, which limits the generalizability of the findings. Safety data, including adverse events and tolerability, were not reported. The study does not attribute causality between model architecture and errors without explicit reporting. Consequently, one should not infer clinical outcomes from synthetic case simulations or assume patient safety based solely on automated evaluation accuracy.
The authors conclude that massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk. This practice relevance highlights the need for rigorous testing environments prior to real-world application.