A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos

Ming ZHOU , Tong WANG

Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (1) : 88 -95.

PDF (1012KB)
Journal of Donghua University(English Edition) ›› 2025, Vol. 42 ›› Issue (1) :88 -95. DOI: 10.19884/j.1672-5220.202402011
Information Technology and Artificial Intelligence
research-article

A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos

Author information +
History +
PDF (1012KB)

Abstract

Video classification is an important task in video understanding and plays a pivotal role in intelligent monitoring of information content. Most existing methods do not consider the multimodal nature of the video, and the modality fusion approach tends to be too simple, often neglecting modality alignment before fusion. This research introduces a novel dual stream multimodal alignment and fusion network named DMAFNet for classifying short videos. The network uses two unimodal encoder modules to extract features within modalities and exploits a multimodal encoder module to learn interaction between modalities. To solve the modality alignment problem, contrastive learning is introduced between two unimodal encoder modules.Additionally, masked language modeling(MLM) and video text matching(VTM) auxiliary tasks are introduced to improve the interaction between video frames and text modalities through backpropagation of loss functions.Diverse experiments prove the efficiency of DMAFNet in multimodal video classification tasks. Compared with other two mainstream baselines, DMAFNet achieves the best results on the 2022 WeChat Big Data Challenge dataset.

Keywords

video classification / multimodal fusion / feature alignment

Cite this article

Download citation ▾
Ming ZHOU, Tong WANG. A Dual Stream Multimodal Alignment and Fusion Network for Classifying Short Videos. Journal of Donghua University(English Edition), 2025, 42(1): 88-95 DOI:10.19884/j.1672-5220.202402011

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

LAPTEV I. On space-time interest points[J]. International Journal of Computer Vision, 2005, 64(2): 107-123.

[2]

WANG H, KLÄSER A, SCHMID C, et al. Dense trajectories and motion boundary descriptors for action recognition[J]. International Journal of Computer Vision, 2013, 103(1): 60-79.

[3]

DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015: 2625-2634.

[4]

JI S, XU W, YANG M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.

[5]

VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. California: ACM, 2017: 6000-6010.

[6]

TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 4489-4497.

[7]

SZEGEDY C, LIU W, JIA Y Q, et al. Going deeper with convolutions[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015: 1-9.

[8]

CARREIRA J, ZISSERMAN A. A new model and the kinetics dataset[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 4724-4733.

[9]

XIE S N, SUN C, HUANG J, et al. Rethinking spatiotemporal feature learning:speed-accuracy trade-offs in video classification[C]//15th European Conference on Computer Vision(ECCV). Berlin: Springer, 2018: 318-335.

[10]

TRAN D, WANG H, TORRESANI L, et al.A closer look at spatiotemporal convolutions for action recognition[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 6450-6459.

[11]

ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT:a video vision transformer[C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 6816-6826.

[12]

BERTASIUS G, WANG H, TORRESANI L.Is space-time attention all you need for video understanding?[C]//International Conference on Machine Learning (ICML). San Diego: JMLR, 2021:139.

[13]

LI K C, WANG Y L, GAO P, et al. UniFormer: unified transformer for efficient spatiotemporal representation learning[EB/OL]. (2022-02-08)[2023-10-20]. https://arxiv.org/abs/2201.04676.

[14]

DAVE I R, RIZVE M N, CHEN C, et al. TimeBalance: temporally-invariant and temporally-distinctive video representations for semi-supervised action recognition[C]//2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2023: 2341-2352.

[15]

HE B, LI H D, JANG Y K, et al. MA-LMM: memory-augmented large multimodal model for long-term video understanding[EB/OL]. (2024-04-24)[2024-05-16]. https://arxiv.org/abs/2404.05726.

[16]

DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Conference of the North-American-Chapter of the Association-for-Computational-Linguistics-Human Language Technologies (NAACL-HLT). Stroudsburg: ACL, 2019: 4171-4186.

[17]

LIN R C, XIAO J, FAN J P. NeXtVLAD:an efficient neural network to aggregate frame-level features for large-scale video classification[C]//15th European Conference on Computer Vision (ECCV). Berlin: Springer, 2018: 206-218.

[18]

SUN C, MYERS A, VONDRICK C, et al. VideoBERT: a joint model for video and language representation learning[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 7463-7472.

Funding

Fundamental Research Funds for the Central Universities, China(2232021A-10)

National Natural Science Foundation of China(61903078)

Shanghai Sailing Program, China(22YF1401300)

Natural Science Foundation of Shanghai, China(20ZR1400400)

PDF (1012KB)

69

Accesses

0

Citation

Detail

Sections
Recommended

/