Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation✩

Rongsen Wu; Jie Xu; Hao Zheng; Zhiyuan Xu; Zixuan Li; Shixue Cheng; Shumao Zhang

doi:10.1016/j.dcan.2025.04.010

›› 2026, Vol. 12 ›› Issue (2) :364 -374. DOI: 10.1016/j.dcan.2025.04.010

Regular Papers

research-article

Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation^✩

Author information +

School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China

^* E-mail address: xuj@uestc.edu.cn (J. Xu).

Rongsen Wu is currently pursuing the Ph.D. degree in Information and Commu-nication Engineering with the School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China. His research interests include video scene graph generation, video understanding, and deep learning.

Jie Xu is currently a Full Professor with the School of Information and Communication Engineering in University of Electronic Science and Technology of China. He has authored or coauthored more than 60 publications in journals and conferences. He participated in the organization and review of more than 20 journals and conferences as a reviewer or TPC member. His research interests include the Knowledge graph construction and reasoning, multimodal data processing and deep learning.

Zhiyuan Xu is currently a senior undergraduate student at the School of Informa-tion and Communication Engineering, University of Electronic Science and Technology of China. His primary research interests include artificial intelligence and large language models.

Zixuan Li received the Master’s degree in communication engineering in 2025 from the University of Electronic Science and Technology of China, Chengdu, China. His re-search interests include multimodal and deep learning algorithms.

Shixue Cheng is currently working toward the Master’s degree with the School of Information and Communication Engineering, University of Electronic Science and Tech-nology of China, Chengdu, China. Her research interests include deep learning-based time series forecasting algorithms for power systems and multimodal knowledge graphs.

Shumao Zhang is currently working toward the Master’s degree with the School of Information and Communication Engineering, University of Electronic Science and Tech-nology of China, Chengdu, China. His research focuses on machine learning, knowledge graph reasoning, and multimodal learning. He is particularly interested in representation learning, graph neural networks, and the integration of multimodal information for intel-ligent knowledge modeling and cognitive reasoning.

Show less

History +

PDF

Abstract

In the field of video scene graph generation, spatio-temporal feature extraction and the long-tail eﬀect in relationship classification are core research issues. This paper proposes extracting spatio-temporal features using the global-local Transformer model for video scene graph generation. Methods based on the Transformer architecture and attention mechanism enrich the semantic information of spatio-temporal features in videos, thereby improving the accuracy of relationship classification. In the feature processing module, pose features are introduced to strengthen the semantic representation of objects. In the spatial feature encoding module, a local spatial visibility matrix based on bounding boxes and key points of human pose features is proposed to add the issue of insuﬃcient attention to local details in traditional Transformer encoders. In the temporal feature encoding module, a global random frame extraction strategy is proposed, which considers global temporal features while also taking computational complexity into account. In the relation classification module, to address the uneven distribution of object and relation categories in the Action Genome dataset, a relation classification loss function based on bipartite graph matching and Focal Loss is proposed, which alleviates the long-tail eﬀect in relation classification and improves the accuracy.

Keywords

Video scene graph generation / Transformer / Pose features / Visibility matrix / Bipartite graph matching

Cite this article

Download citation ▾

Rongsen Wu, Jie Xu, Hao Zheng, Zhiyuan Xu, Zixuan Li, Shixue Cheng, Shumao Zhang. Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation^✩. , 2026, 12(2): 364-374 DOI:10.1016/j.dcan.2025.04.010

登录浏览全文

4963

注册一个新账户忘记密码

CRediT authorship contribution statement

Rongsen Wu: Writing-review & editing, Writing-original draft, Visualization, Methodology, Conceptualization. Jie Xu: Writing-re-view & editing, Resources, Project administration. Hao Zheng: Inves-tigation, Formal analysis, Data curation, Conceptualization. Zhiyuan Xu: Visualization, Investigation. Zixuan Li: Visualization, Validation. Shixue Cheng: Validation, Software. Shumao Zhang: Visualization, Validation, Resources.

Declaration of competing interest

The authors declare that they have no conflicts of interest to report regarding the present study. This research was not influenced by any financial or personal relationships with other people or organizations that could inappropriately impact our work.

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 62071098); Sichuan Science and Technology Program (Grants 2022YFG0319, 2023YFG0301 and 2023YFG0018).

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	T. Zhou, X. Zhang, B. Kang, M. Chen, Multimodal fusion recognition for digital twin, Digit. Commun. Netw. 10 (2) (2024) 337-346.

[2]	Y. Liu, X. Wang, Z. Ning, M. Zhou, L. Guo, B. Jedari, A survey on semantic communi-cations: technologies, solutions, applications and challenges, Digit. Commun. Netw. 10 (3) (2024) 528-545.

[3]	Y. Xie, Y. Zhang, T. Lin, Z. Pan, S.-Z. Qian, B. Jiang, J. Yan, Short video preloading via domain knowledge assisted deep reinforcement learning, Digit. Commun. Netw. 10 (6) (2024) 1826-1836.

[4]	J. Xu, R. Song, H. Wei, J. Guo, Y. Zhou, X. Huang, A fast human action recognition network based on spatio-temporal features, Neurocomputing 441 (2021) 350-358.

[5]	W. Fu, Z. Luo, S. Liu, J. Lloret, V.H.C. de Albuquerque, A.K.J. Saudagar, K. Muham- mad, Spatiotemporal correlation based self-adaptive pose estimation in complex scenes, Digit. Commun. Netw. (2024), https://doi.org/10.1016/j.dcan.2024.03.007.

[6]	C. Gan, J. Yao, S. Ma, Z. Zhang, L. Zhu, The deep spatiotemporal network with dual-flow fusion for video-oriented facial expression recognition, Digit. Commun. Netw. 9 (6) (2023) 1441-1447.

[7]	J. Xu, X. Zhang, C. Zhao, Z. Geng, Y. Feng, K. Miao, Y. Li, Improving fine-grained image classification with multimodal information, IEEE Trans. Multimed. 26 (2024) 2082-2095.

[8]	J. Johnson, A. Gupta, L. Fei-Fei,Image generation from scene graphs, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1219-1228.

[9]	M.U. Hassan, S. Alaliyat, I.A. Hameed, Image generation models from scene graphs and layouts: a comparative analysis, J. King Saud Univ, Comput. Inf. Sci. 35 (5) (2023) 101543.

[10]	S. Nag, K. Min, S. Tripathi, A.K. Roy-Chowdhury,Unbiased scene graph generation in videos, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22803-22813.

[11]	S. Afzal, S. Ghani, M.M. Hittawe, S.F. Rashid, O.M. Knio, M. Hadwiger, I. Hoteit, Visualization and visual analytics approaches for image and video datasets: a survey, ACM Trans. Interact. Intell. Syst. 13 (1) (2023) 1-41.

[12]	Z. Geng, J. Xu, R. Wu, C. Zhao, J. Wang, Y. Li, C. Zhang, Stgaformer: spatial-temporal gated attention transformer based graph neural network for traﬃc flow forecasting, Inf. Fusion 105 (2024) 102228.

[13]	R. Sortino, S. Palazzo, F. Rundo, C. Spampinato, Transformer-based image genera-tion from scene graphs, Comput. Vis. Image Underst. 233 (2023) 103721.

[14]	H. Li, G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, X. Zhao, S.A.A. Shah, M. Bennamoun, Scene graph generation: a comprehensive survey, Neurocomputing 566 (2024) 127052.

[15]	R. Peddi, S. Singh, P. Singla, V. Gogate, et al., Towards scene graph anticipation,in: European Conference on Computer Vision, Springer, 2025, pp. 159-175.

[16]	S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137-1149.

[17]	Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, M.Y. Yang, Spatial-temporal trans-former for dynamic scene graph generation, in: 2021 IEEE/CVF International Con-ference on Computer Vision (ICCV), 2021, pp. 16352-16362.

[18]	S. Chen, J. Xiao, L. Chen,Video scene graph generation from single-frame weak supervision, in:The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=KLrGlNoxzb4.

[19]	Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, C.-W. Chen, End-to-end video scene graph generation with temporal propagation transformer, IEEE Trans. Multimed. 26 (2024) 1613-1625.

[20]	C. Liu, Y. Jin, K. Xu, G. Gong, Y. Mu, Beyond short-term snippet: video relation detection with spatio-temporal global context,in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10837-10846.

[21]	K. Gao, L. Chen, Y. Niu, J. Shao, J. Xiao, Classification-then-grounding: refor-mulating video scene graphs as temporal bipartite graphs,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19497-19506.

[22]	Y. Chen, Y. Cao, H. Hu, L. Wang,Memory enhanced global-local aggregation for video object detection, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10337-10346.

[23]	N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213-229.

[24]	D. Osokin, Real-time 2d multi-person pose estimation on cpu: lightweight openpose,in: ICPRAM 2019-Proceedings of the 8th International Conference on Pattern Recog-nition Applications and Methods, 2019, pp. 744-748.

[25]	A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: eﬃcient convolutional neural networks for mobile vision ap-plications, https://doi.org/10.48550/arXiv.1704.04861, 2017.

[26]	J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word represen-tation,in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532-1543.

[27]	H.W. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Logist. Q. 2 (1-2) (1955) 83-97.

[28]	X. Lin, C. Shi, Y. Zhan, Z. Yang, Y. Wu, D. Tao, Td²-net: toward denoising and de-biasing for video scene graph generation, Proc. AAAI Conf. Artif. Intell. 38 (2024) 3495-3503.

[29]	Y. Teng, L. Wang, Z. Li, G. Wu,Target adaptive context aggregation for video scene graph generation, in:Proceedings of the IEEE/CVF International Conference on Com-puter Vision (ICCV), 2021, pp. 13688-13697.

[30]	K. Kim, K. Yoon, Y. In, J. Jeon, J. Moon, D. Kim, C. Park, Weakly Supervised Video Scene Graph Generation via Natural Language Supervision, https://doi.org/10.48550/arXiv.2502.15370, 2025.

[31]	Y. Cong, M.Y. Yang, B. Rosenhahn, Reltr: relation transformer for scene graph gen-eration, IEEE Trans. Pattern Anal. Mach. Intell. 45 (9) (2023) 11169-11183.

[32]	R. Peddi, S. Singh, Saurabh, P. Singla, V. Gogate, Towards scene graph anticipation, in: European Conference on Computer Vision, Springer, 2024, pp. 159-175.

[33]	J. Zhang, K.J. Shih, A. Elgammal, A. Tao, B. Catanzaro,Graphical contrastive losses for scene graph parsing, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11535-11543.

[34]	X. Lin, C. Ding, J. Zeng, D. Tao, Gps-net: graph property sensing network for scene graph generation,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3746-3753.