1 Introduction
Dizziness and vertigo are common, high-impact presentations in primary care, emergency, and specialty clinics, yet early classification remains difficult because brief positional vertigo, migrainous features, fluctuating auditory symptoms, and acute audiovestibular events often overlap [
1,
2]. Guideline frameworks (e.g., AAO–HNS for BPPV; Bárány Society criteria for peripheral vestibular disorders) provide definitional clarity, but their application at scale still hinges on complete, structured history-taking and consistent clinical synthesis [
3,
4]. Delays or misclassification can alter downstream testing and treatment, underscoring the need for reliable, scalable decision support grounded in patient-reported history [
5,
6].
Electronic, structured questionnaires address part of this need by standardizing symptom capture and creating analyzable inputs for decision support [
7,
8]. Supervised machine learning (ML) models trained on such instruments have reported encouraging performance and can be tuned for calibration and specificity; however, they require labeled training data, re-training under distribution shift, and feature maintenance across sites [
9,
10]. Meanwhile, large language models (LLMs) have rapidly advanced in zero-/few-shot reasoning over clinical text and semistructured inputs, raising the prospect of flexible, promptable diagnostic support from patient-reported information [
11]. What is still limited are rigorous, multicenter, head-to-head evaluations that place zero-shot LLMs and task-specific supervised models on the
same prospective cohort, quantify uncertainty, and map error structure at the level of individual vestibular disorders [
11].
To address these gaps, we studied a seven-center prospective cohort using an electronic 23-item questionnaire to perform a
five-class vestibular diagnosis task (BPPV, vestibular migraine, Meniere disease, SSNHL-V, and a composite “Others”). We evaluated three contemporary LLMs in a zero-shot setting and compared the top LLM with a LightGBM (LGBM) model trained on the same instrument. Evaluation combined overall ranking metrics (Top-
k, MRR, NDCG@5) with per-disorder one-vs-rest sensitivity, specificity, and accuracy; uncertainty was quantified with patient-level bootstrap and paired bootstrap for model differences, and McNemar’s test was used for accuracy on the shared external test set [
12,
13]. At a high level, LLMs delivered strong ranking quality (Top-1 around two-thirds and Top-3 above ninety percent) and exhibited clinically coherent disorder-wise profiles; the best LLM achieved accuracy comparable to the trained LGBM on the external test subset, with a non-significant difference. Error anatomy highlighted reliable recognition of BPPV, persistent difficulty for vestibular migraine, a sensitivity–specificity split for Meniere disease, and high specificity with favorable sensitivity for SSNHL-V.
Our contributions are threefold. First, we provide a prospective, multicenter, head-to-head assessment of zero-shot LLMs against a task-specific supervised comparator on a standardized, five-class vestibular instrument—closing a key evidence gap for history-driven decision support [
14]. Second, we pair aggregate metrics with rigorous uncertainty quantification and a clinically interpretable error analysis (including confusion structures) that identifies where prompt design or guardrails can curb systematic confusions. Third, we outline an
LLM-centered workflow that is immediately practical with fixed questionnaires and, crucially, extends to conversational intake: the same models can conduct adaptive patient interviews, reconcile inconsistencies, and generate uncertainty-aware differentials for clinician review—capabilities with potential to improve diagnostic fidelity, patient acceptance, and clinic throughput when deployed with appropriate safety, calibration, and governance measures [
15].
2 Methods
2.1 Cohort and Data Collection
This work uses the prospective, multicenter cohort established in seven tertiary ENT/vertigo clinics (August 2019–March 2021), namely: ENT and vertigo clinics of Eye & ENT Hospital of Fudan University; The Second Hospital of Anhui Medical University; The First Affiliated Hospital of Xiamen University; Shengjing Hospital of China Medical University; Shanghai Pudong Hospital; Shenzhen Second People’s Hospital; and The First Affiliated Hospital of Chongqing Medical University. At the first specialist visit, eligible patients completed an electronic diagnostic questionnaire on a tablet or smartphone after informed consent; for those unable to complete it independently, trained staff read the questions aloud and recorded responses. Routine clinical care and follow-up proceeded without protocol interference. Reference diagnoses were assigned by ENT specialists (> 5 years of experience) who were blinded to questionnaire responses and applied guideline-based criteria (AAO-HNS for BPPV and Bárány Society criteria for other vestibular disorders).
In total, 1,760 patients were approached and 1,693 enrolled after consent (96.2% response; 67 declined). Of the enrolled, 1,041 received a single, final diagnosis within the two-month follow-up window. For evaluation reliability, we excluded cases with multiple diagnoses (n = 14), only probable diagnoses (n = 145), undetermined diagnoses (n = 493), and an additional 16 records with contradictory entries identified during pre-analysis quality control, yielding 1,025 single-definite cases for analysis (Fig.1). For traditional machine learning, 912 cases were used for training and 113 for testing (the held-out external test set from the published cohort). LLMs required no training and were evaluated on all 1,025 cases; for head-to-head comparison with traditional models, metrics were computed on the shared test set (n = 113). Key demographics and case-mix are summarized in Tab.1: median age was 54 years (IQR 41–65), 56.5% were female, and BPPV was the most frequent single diagnosis (38.6%), followed by vestibular migraine (20.1%), Meniere disease (19.3%), and sudden sensorineural hearing loss with vestibular dysfunction (15.3%).
2.2 Questionnaire: design, administration, and content
The diagnostic questionnaire was built through a three-stage iterative process—focus/panel meetings (drafting disorder features), patient cognitive interviews (simplifying wording and pruning items), and an expert panel review (reordering/merging items). The final instrument comprised 23 items with branching logic and was administered electronically as above. Content covered: symptom character; attack frequency/duration and time since first onset; laterality and dynamics of hearing loss; tinnitus/aural-fullness/earache around attacks; headache features and family history; photophobia/phonophobia; unsteadiness and worsening with standing/walking; falls/consciousness/incontinence during attacks; common triggers (positional change, Valsalva/sound/pressure, visually complex scenes, foods/odors, fatigue/insomnia/anger); cervicogenic clues (upper-limb numbness/neck pain); prodromal infections; and otologic/trauma history.
For modeling and reporting, diagnostic categories followed the published work [
10]: BPPV, vestibular migraine, Menière disease, sudden sensorineural hearing loss with vestibular dysfunction (SSNHL-V), and an “Others” bin for individually rare conditions (e.g., vestibular neuritis, PPPD, bilateral vestibulopathy, psychogenic dizziness, delayed endolymphatic hydrops, vestibular paroxysmia, cervicogenic vertigo, acoustic neuroma, presbyvestibulopathy, light cupula, Ramsay–Hunt syndrome, labyrinthine fistula, and superior canal dehiscence).
2.3 Evaluation metrics and uncertainty quantification
All point estimates are computed at the patient level. Unless stated otherwise, large-language model (LLM) ranking metrics are calculated on the full analysis set (n = 1,025), whereas the head-to-head comparison between the best LLM and the LightGBM (LGBM) baseline uses the shared external test set (n = 113). Classification metrics. Sensitivity and specificity are computed for each diagnosis in a one-vs-rest manner and then averaged with equal weight across the five classes (macro average). Overall accuracy is the proportion of correct top-1 predictions. Ranking metrics. Top-k accuracy counts a case as correct if the reference label appears within the model’s top k. Mean reciprocal rank (MRR) averages the inverse of the position of the correct label. NDCG@5 rewards a correct label near the top of the list and is normalized to 1 for an ideal ranking. Uncertainty and paired testing. All error bars shown in figures are 95% confidence intervals from 1,000 patient-level bootstrap samples. For paired model comparisons on the same patients, we use a paired bootstrap to form CIs for metric differences; for accuracy, we additionally report McNemar’s test. Unless noted, p-values are two-sided and no multiple-comparison adjustment is applied.
3 Results
3.1 Large Language Model Performance Evaluation
Across the five-class diagnostic ranking task, all three LLMs substantially outperformed a prevalence-based prior baseline (Tab.2; Fig.2). Against the baseline (Top-1 38.6%, MRR 0.603, NDCG@5 0.701), DeepSeek-R1 improved Top-1 by +15.4 percentage points (54.0%), DeepSeek-V3 by +26.6 points (65.2%), and Doubao-1.6-thinking by +27.0 points (65.6%). Gains were consistent when allowing more candidates: Top-3 reached 88.5% (R1), 91.7% (V3), and 94.0% (Doubao). These absolute improvements translate into stronger ranking quality, with mean reciprocal rank (MRR) of 0.719, 0.789, and 0.795, respectively, and NDCG@5 of 0.790, 0.842, and 0.846.
Model ordering was stable across metrics: DeepSeek-R1 formed the lower tier, while DeepSeek-V3 and Doubao-1.6-thinking clustered at the top with very similar point estimates. Notably, the leading model depends on the metric—Doubao is marginally higher on Top-1 (65.6% vs. 65.2% for V3) and ranking metrics (MRR 0.795 vs. 0.789; NDCG@5 0.846 vs. 0.842 in Tab.2). Uncertainty estimates (95% CIs) were obtained by bootstrapping over patients and are narrow enough to support the above ordering. Pairwise MRR tests confirm that R1 is significantly below both V3 and Doubao (p = 8.4×10−14 and 3.2×10−15), whereas V3 and Doubao are statistically indistinguishable on MRR (p = 0.41). Taken together, these results suggest that, on this structured clinical questionnaire, current frontier LLMs deliver robust ranking performance: a two-thirds Top-1 hit rate without task-specific training, and high Top-3 coverage exceeding 90%, which could meaningfully reduce the downstream diagnostic search space for clinicians.
3.2 Diagnostic Error Pattern Analysis
Class-wise performance shows distinct difficulty profiles across disorders (Fig.3). BPPV is reliably recognized (sensitivities: R1 86.6%, V3 83.6%, Doubao 82.6%), but R1’s higher sensitivity is accompanied by substantially lower specificity (64.4%) compared with V3 (82.2%) and Doubao (84.7%), both significantly higher; Doubao’s specificity also exceeds V3’s slightly. Vestibular migraine (VM) remains the most challenging named disorder: V3 improves sensitivity over R1 (52.9% vs 43.2%; significant), while Doubao yields the best specificity (89.1%), significantly above both R1 and V3, at a sensitivity comparable to V3 (49.0%; not significant). For Meniere disease, V3 is sensitivity-oriented (71.2%), significantly exceeding both R1 and Doubao, whereas Doubao is specificity-oriented (94.8%), significantly higher than V3 and R1 with intermediate sensitivity (60.6%). In SSNHL-V, the main separation is sensitivity: Doubao (60.5%) > V3 (51.0%; significant) ≫ R1 (24.8%; highly significant), while all three maintain similarly high specificity (around 98%; no significant differences). Finally, in the heterogeneous Others category, Doubao markedly increases sensitivity (42.6% vs. 11.8% for R1 and 10.3% for V3; both highly significant) at the cost of lower specificity (88.7% vs. around 97%–98%; both highly significant). Overall, these patterns indicate sensitivity–specificity trade-offs: V3 favors sensitivity in Meniere disease, Doubao favors specificity in VM and sensitivity in SSNHL-V/Others, and R1 tends to over-call BPPV.
The confusion matrices and diagnostic-flow visualization (Fig.4 and Fig.5) reveal three recurring error channels. First, a prominent VM↔BPPV axis: a large share of VM is labeled as BPPV (V3 27.7%; Doubao 25.7%), with a smaller but clear BPPV→VM spillover (V3 10.4%). Second, SSNHL-V→Meniere disease is frequent (V3 19.7%, R1 23.6%), and is reduced by Doubao (9.6%), indicating that acute audiovestibular cues are sometimes attributed to endolymphatic pathology when questionnaire signals are ambiguous. Third, the composite Others category disperses broadly into common peripheral disorders; Doubao’s higher sensitivity increases on-diagonal hits (42.6%) but also raises false positives, consistent with its lower specificity.
Taken together, the per-class metrics and confusion structure are clinically plausible: BPPV is a high-signal target detected well by all models (with R1 over-prediction), VM remains difficult due to heterogeneous symptom constellations, Meniere disease exhibits a clear sensitivity–specificity split (V3 for case-finding, Doubao for rule-in), and SSNHL-V benefits from models that better leverage acute auditory cues (Doubao > V3 ≫ R1 in sensitivity with uniformly high specificity). These insights indicate where LLMs already narrow the diagnostic search space and where domain-aware prompting could further curb systematic errors.
3.3 Comparison with Traditional Machine Learning Approaches
3.3.1 Head-to-head performance on the held-out test set
On the shared external test set (n = 113), the traditional gradient-boosted trees model (LGBM; trained on 912 cases) achieved slightly higher point estimates than the zero-shot LLM (DeepSeekV3) on all three scalar metrics (Fig.6, left panel). Sensitivity was 0.722 for LGBM versus 0.632 for V3; specificity was high for both (0.941 vs. 0.926); and overall accuracy was 0.770 for LGBM versus 0.742 for V3. Error bars (patient-level bootstrap 95% CIs) overlap broadly, and a paired McNemar test for accuracy yielded p = 0.690 (annotated in the figure), indicating no statistically significant difference. The win–loss breakdown further illustrates the small margin (Fig.6, right panel): among 113 patients, the models agreed on 78% of cases (both correct 65%, both wrong 13%); in the remaining 22%, V3 outperformed LGBM in 10% (11/113) and underperformed in 12% (14/113), corresponding to a ∆Accuracy of −2.7 percentage points in favor of LGBM.
3.3.2 Disorder-specific trade-offs and diagnostic weight
Class-wise estimates (Tab.3) help explain the near-tie overall. LGBM shows consistently higher specificity for common disorders (e.g., BPPV 0.94 vs. 0.82; Meniere disease 0.97 vs. 0.94) while maintaining comparable sensitivity, yielding larger positive likelihood ratios for rule-in decisions (BPPV + LR 13.86; Meniere disease + LR 21.19). In contrast, the LLM displays standout performance for SSNHL-V with vestibular symptoms: specificity reached 1.00 with high sensitivity (0.89), giving an infinite +LR and a small –LR (0.11), properties desirable for flagging this time-sensitive condition. For vestibular migraine, both methods have similar specificity (0.88) with modest sensitivities (0.48–0.57), consistent with the heterogeneous symptom profiles observed earlier. Net accuracy therefore favors LGBM in higher-prevalence categories (BPPV, Meniere disease, Others) but favors V3 in SSNHL-V; after prevalence weighting on the test set, these effects largely cancel, producing the small, non-significant aggregate gap.
3.3.3 Representative case comparisons
The case vignettes in Tab.4 illustrate complementary error tendencies. Case 1 (BPPV) shows the LLM correctly prioritizing classic positional triggers despite distracting contextual details, whereas LGBM favored vestibular migraine—suggesting the LLM’s strength in leveraging long-range semantic cues. Case 2 (BPPV) shows the converse: LGBM correctly identifies BPPV while the LLM overweights fluctuating auditory complaints and ranks Meniere disease first, reflecting its sensitivity to otologic descriptors when positional information is present but not dominant. Case 3 (labeled Meniere disease) highlights a failure mode shared by both models when brief, positional vertigo co-occurs with chronic unilateral auditory symptoms; neither model resolved the mixed signal reliably.
Overall the broader message is that a general solution (LLM, zero-shot) has reached parity adjacent performance with a specialized solution (LGBM trained on domain cases). Given the LLM’s advantages in deployment friction (no training), interactive explainability, and rapid adaptation through prompting, these results argue for a combined model–assistant paradigm in which the LLM front-ends patient interaction and hypothesis generation, while a lightweight supervised model provides calibration and high-specificity checks for common peripheral disorders.
4 Conclusion
In this seven-center prospective evaluation of a five-class vestibular diagnosis task, contemporary large language models (LLMs) used in a zero-shot manner on a structured 23-item questionnaire achieved competitive—and practically useful—performance. LLMs consistently surpassed a prevalence baseline and delivered strong ranking quality (Top-1 ~ 65%, Top-3 > 90%, MRR/NDCG@5 in the 0.79 - 0.85 range), narrowing the diagnostic search space without task-specific training. Against a purpose-trained LightGBM comparator included solely as a reference, the best LLM showed parity-adjacent accuracy on an external test set (0.742 vs. 0.770; McNemar p = 0.690), underscoring that modern, general-purpose LLMs can match specialized classifiers while offering advantages in promptability, deployment simplicity, and rationale generation.
Disorder-wise patterns were clinically coherent and actionable. LLMs reliably recognized high signal BPPV, demonstrated uniformly high specificity for SSNHL-V while maintaining the most favorable sensitivity profile among models, and revealed a sensitivity–specificity split for Meniere disease (case-finding versus rule-in emphasis). Vestibular migraine remained the most challenging entity, with a prominent VM↔BPPV confusion axis; the heterogeneous “Others” category highlighted a sensitivity gain at a manageable specificity trade-off. These findings indicate that an LLM-centered workflow is already viable: LLMs can front-end history-based differential generation with transparent reasoning, with a lightweight supervised checker optionally layered for calibration or rule-in specificity where clinically warranted.
Although our evaluation standardized inputs via a fixed questionnaire, LLMs are not constrained to forms. The same models can conduct adaptive, conversational intake—asking clarifying follow-ups, probing timing and triggers, reconciling inconsistencies, and summarizing the differential with uncertainty-aware guidance. Such interactive acquisition is poised to further improve diagnostic fidelity and, moreover, to enhance patient acceptance and clinic throughput through natural, “human-like” dialogue. Real-world deployment should pair this capability with guardrails (calibration, abstention/escalation rules, and auditing) and prospective monitoring, but the central message stands: LLMs are ready to serve as the primary engine for history-driven vestibular diagnosis, with traditional models retained as comparators to sharpen specificity when needed.
The Author(s) 2025. This article is available under open access at journal.hep.com.cn.