A3Bench: An Audience-Aligned Multilingual Benchmark for Video Audience Insights Understanding
Yiming Lei , Guozhen Peng , Zeming Liu , Hui Qiu , Haitao Leng , Shaoguo Liu , Tingting Gao , Qingjie Liu , Annan Li , Yunhong Wang
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding and consistently perform well on vision-centric benchmarks. However, existing benchmarks primarily evaluate factual or event-based comprehension, while neglecting audience insights. It is a critical yet underexplored dimension of video understanding, reflecting a deep comprehension of cognitive processes from the audience’s perspective. As a result, MLLMs, shaped by such benchmarks, often produce responses that are factually correct but misaligned with audience’s interests. To bridge this gap, we leverage audience insights derived from video comments as a direct proxy to guide the annotation process and introduce A3Bench, an audience-aligned benchmark for evaluating video audience insights with large-scale videos and high-quality multilingual comments. Furthermore, inspired by neuro-imaging studies, we propose Cognition Interaction of Thought (CIoT), a structured reasoning framework that emulates key aspects of cognitive processes. Extensive experiments on A3Bench reveal that current MLLMs struggle to understand audience insights, particularly compared to human-level understanding. In contrast, CIoT can improve the performance of these models, highlighting its potential to enhance the MLLMs’ capability of understanding audience insights in future research.
Multilingual Video Understanding / Audience Insights / CIoT / MLLMs
©The Author(s) 2026.
/
| 〈 |
|
〉 |