Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues
Hao WANG , Bin GUO , Mengqi CHEN , Qiuyun ZHANG , Yasan DING , Ying ZHANG , Zhiwen YU
Front. Comput. Sci. ›› 2025, Vol. 19 ›› Issue (7) : 197329
Cascade context-oriented spatio-temporal attention network for efficient and fine-grained video-grounded dialogues
Video-Grounded Dialogue System (VGDS), focusing on generating reasonable responses based on multi-turn dialogue contexts and a given video, has received intensive attention recently. The key to building a superior VGDS lies in efficiently reasoning over visual and textual concepts of various granularities and achieving comprehensive visual-textual multi-modality alignment. Despite remarkable research progress, existing studies suffer from identifying context-relevant video parts while disregarding the impact of redundant information in long-form and content-dynamic videos. Further, current methods usually align all semantics in different modalities uniformly using a one-time cross-attention scheme, which neglects the sophisticated correspondence between various granularities of visual and textual concepts (e.g., still objects with nouns, dynamic events with verbs). To this end, we propose a novel system, namely Cascade cOntext-oriented Spatio-Temporal Attention Network (COSTA), to generate reasonable responses efficiently and accurately. Specifically, COSTA first adopts a cascade attention network to localize only the most relevant video clips and regions in a coarse-to-fine manner which effectively filters the irrelevant visual semantics. Secondly, we design a memory distillation-inspired iterative visual-textual cross-attention strategy to progressively integrate visual semantics with dialogue contexts across varying granularities, facilitating extensive multi-modal alignment. Experiments on several benchmarks demonstrate significant improvements in our model over state-of-the-art methods across various metrics.
video-grounded dialogue / spatio-temporal attention / multi-modality
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
The Vicuna Team. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatGPT quality. See vicuna. lmsys. org website (accessed 14 April 2023), 2023, 2(3): 6 |
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
Higher Education Press
/
| 〈 |
|
〉 |