TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions

Yangliu HU , Zikai SONG , Junqing YU , Yiping Phoebe CHEN , Wei YANG

Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) : 2204 -2214.

PDF (619KB)
Eng Inform Technol Electron Eng ›› 2025, Vol. 26 ›› Issue (11) :2204 -2214. DOI: 10.1631/FITEE.2500412
Research Article

TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions

Author information +
History +
PDF (619KB)

Abstract

Video large language models (video-LLMs) have demonstrated impressive capabilities in multimodal understanding, but their potential as zero-shot evaluators for temporal consistency in video captions remains underexplored. Existing methods notably underperform in detecting critical temporal errors, such as missing, hallucinated, or misordered actions. To address this gap, we introduce two key contributions. (1) TimeJudge:a novel zero-shot framework that recasts temporal error detection as answering calibrated binary question pairs. It incorporates modality-sensitive confidence calibration and uses consistency-weighted voting for robust prediction aggregation. (2) TEDBench:a rigorously constructed benchmark featuring videos across four distinct complexity levels, specifically designed with fine-grained temporal error annotations to evaluate video-LLM performance on this task. Through a comprehensive evaluation of multiple state-of-the-art video-LLMs on TEDBench, we demonstrate that TimeJudge consistently yields substantial gains in terms of recall and F1-score without requiring any task-specific fine-tuning. Our approach provides a generalizable, scalable, and training-free solution for enhancing the temporal error detection capabilities of video-LLMs.

Keywords

Video large language model (Video-LLM) / Multimodal large language model (MLLM) / MLLM-as-a-Judge / Video caption / Benchmark

Cite this article

Download citation ▾
Yangliu HU, Zikai SONG, Junqing YU, Yiping Phoebe CHEN, Wei YANG. TimeJudge: empowering video-LLMs as zero-shot judges for temporal consistency in video captions. Eng Inform Technol Electron Eng, 2025, 26(11): 2204-2214 DOI:10.1631/FITEE.2500412

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Zhejiang University Press

AI Summary AI Mindmap
PDF (619KB)

41

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/