TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
Yangliu HU , Zikai SONG , Junqing YU , Yiping Phoebe CHEN , Wei YANG
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) : 2204 -2214.
TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions
Video large language models (video-LLMs) have demonstrated impressive capabilities in multimodal understanding, but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored. Existing methods notably underperform in detecting critical temporal errors, such as missing, hallucinated, or misordered actions. To address this gap, we introduce two key contributions. (1) TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs. It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation. (2) TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels, specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task. Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench, we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning. Our approach provides a generalizable, scalable, and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.
Video large language model (Video-LLM) / Multimodal large language model (MLLM) / MLLM-as-a-Judge / Video caption / Benchmark
Zhejiang University Press
/
| 〈 |
|
〉 |