Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation

Rongsen Wu , Jie Xu , Hao Zheng , Zhiyuan Xu , Zixuan Li , Shixue Cheng , Shumao Zhang

›› 2026, Vol. 12 ›› Issue (2) : 364 -374.

PDF
›› 2026, Vol. 12 ›› Issue (2) :364 -374. DOI: 10.1016/j.dcan.2025.04.010
Regular Papers
research-article
Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation
Author information +
History +
PDF

Abstract

In the field of video scene graph generation, spatio-temporal feature extraction and the long-tail effect in relationship classification are core research issues. This paper proposes extracting spatio-temporal features using the global-local Transformer model for video scene graph generation. Methods based on the Transformer architecture and attention mechanism enrich the semantic information of spatio-temporal features in videos, thereby improving the accuracy of relationship classification. In the feature processing module, pose features are introduced to strengthen the semantic representation of objects. In the spatial feature encoding module, a local spatial visibility matrix based on bounding boxes and key points of human pose features is proposed to add the issue of insufficient attention to local details in traditional Transformer encoders. In the temporal feature encoding module, a global random frame extraction strategy is proposed, which considers global temporal features while also taking computational complexity into account. In the relation classification module, to address the uneven distribution of object and relation categories in the Action Genome dataset, a relation classification loss function based on bipartite graph matching and Focal Loss is proposed, which alleviates the long-tail effect in relation classification and improves the accuracy.

Keywords

Video scene graph generation / Transformer / Pose features / Visibility matrix / Bipartite graph matching

Cite this article

Download citation ▾
Rongsen Wu, Jie Xu, Hao Zheng, Zhiyuan Xu, Zixuan Li, Shixue Cheng, Shumao Zhang. Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation. , 2026, 12(2): 364-374 DOI:10.1016/j.dcan.2025.04.010

登录浏览全文

4963

注册一个新账户 忘记密码

CRediT authorship contribution statement

Rongsen Wu: Writing-review & editing, Writing-original draft, Visualization, Methodology, Conceptualization. Jie Xu: Writing-re-view & editing, Resources, Project administration. Hao Zheng: Inves-tigation, Formal analysis, Data curation, Conceptualization. Zhiyuan Xu: Visualization, Investigation. Zixuan Li: Visualization, Validation. Shixue Cheng: Validation, Software. Shumao Zhang: Visualization, Validation, Resources.

Declaration of competing interest

The authors declare that they have no conflicts of interest to report regarding the present study. This research was not influenced by any financial or personal relationships with other people or organizations that could inappropriately impact our work.

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 62071098); Sichuan Science and Technology Program (Grants 2022YFG0319, 2023YFG0301 and 2023YFG0018).

References

[1]

T. Zhou, X. Zhang, B. Kang, M. Chen, Multimodal fusion recognition for digital twin, Digit. Commun. Netw. 10 (2) (2024) 337-346.

[2]

Y. Liu, X. Wang, Z. Ning, M. Zhou, L. Guo, B. Jedari, A survey on semantic communi-cations: technologies, solutions, applications and challenges, Digit. Commun. Netw. 10 (3) (2024) 528-545.

[3]

Y. Xie, Y. Zhang, T. Lin, Z. Pan, S.-Z. Qian, B. Jiang, J. Yan, Short video preloading via domain knowledge assisted deep reinforcement learning, Digit. Commun. Netw. 10 (6) (2024) 1826-1836.

[4]

J. Xu, R. Song, H. Wei, J. Guo, Y. Zhou, X. Huang, A fast human action recognition network based on spatio-temporal features, Neurocomputing 441 (2021) 350-358.

[5]

W. Fu, Z. Luo, S. Liu, J. Lloret, V.H.C. de Albuquerque, A.K.J. Saudagar, K. Muham- mad, Spatiotemporal correlation based self-adaptive pose estimation in complex scenes, Digit. Commun. Netw. (2024), https://doi.org/10.1016/j.dcan.2024.03.007.

[6]

C. Gan, J. Yao, S. Ma, Z. Zhang, L. Zhu, The deep spatiotemporal network with dual-flow fusion for video-oriented facial expression recognition, Digit. Commun. Netw. 9 (6) (2023) 1441-1447.

[7]

J. Xu, X. Zhang, C. Zhao, Z. Geng, Y. Feng, K. Miao, Y. Li, Improving fine-grained image classification with multimodal information, IEEE Trans. Multimed. 26 (2024) 2082-2095.

[8]

J. Johnson, A. Gupta, L. Fei-Fei,Image generation from scene graphs, in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 1219-1228.

[9]

M.U. Hassan, S. Alaliyat, I.A. Hameed, Image generation models from scene graphs and layouts: a comparative analysis, J. King Saud Univ, Comput. Inf. Sci. 35 (5) (2023) 101543.

[10]

S. Nag, K. Min, S. Tripathi, A.K. Roy-Chowdhury,Unbiased scene graph generation in videos, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 22803-22813.

[11]

S. Afzal, S. Ghani, M.M. Hittawe, S.F. Rashid, O.M. Knio, M. Hadwiger, I. Hoteit, Visualization and visual analytics approaches for image and video datasets: a survey, ACM Trans. Interact. Intell. Syst. 13 (1) (2023) 1-41.

[12]

Z. Geng, J. Xu, R. Wu, C. Zhao, J. Wang, Y. Li, C. Zhang, Stgaformer: spatial-temporal gated attention transformer based graph neural network for traffic flow forecasting, Inf. Fusion 105 (2024) 102228.

[13]

R. Sortino, S. Palazzo, F. Rundo, C. Spampinato, Transformer-based image genera-tion from scene graphs, Comput. Vis. Image Underst. 233 (2023) 103721.

[14]

H. Li, G. Zhu, L. Zhang, Y. Jiang, Y. Dang, H. Hou, P. Shen, X. Zhao, S.A.A. Shah, M. Bennamoun, Scene graph generation: a comprehensive survey, Neurocomputing 566 (2024) 127052.

[15]

R. Peddi, S. Singh, P. Singla, V. Gogate, et al., Towards scene graph anticipation,in: European Conference on Computer Vision, Springer, 2025, pp. 159-175.

[16]

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. 39 (6) (2017) 1137-1149.

[17]

Y. Cong, W. Liao, H. Ackermann, B. Rosenhahn, M.Y. Yang, Spatial-temporal trans-former for dynamic scene graph generation, in: 2021 IEEE/CVF International Con-ference on Computer Vision (ICCV), 2021, pp. 16352-16362.

[18]

S. Chen, J. Xiao, L. Chen,Video scene graph generation from single-frame weak supervision, in:The Eleventh International Conference on Learning Representations, 2023, https://openreview.net/forum?id=KLrGlNoxzb4.

[19]

Y. Zhang, Y. Pan, T. Yao, R. Huang, T. Mei, C.-W. Chen, End-to-end video scene graph generation with temporal propagation transformer, IEEE Trans. Multimed. 26 (2024) 1613-1625.

[20]

C. Liu, Y. Jin, K. Xu, G. Gong, Y. Mu, Beyond short-term snippet: video relation detection with spatio-temporal global context,in: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10837-10846.

[21]

K. Gao, L. Chen, Y. Niu, J. Shao, J. Xiao, Classification-then-grounding: refor-mulating video scene graphs as temporal bipartite graphs,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19497-19506.

[22]

Y. Chen, Y. Cao, H. Hu, L. Wang,Memory enhanced global-local aggregation for video object detection, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10337-10346.

[23]

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in: European Conference on Computer Vision, Springer, 2020, pp. 213-229.

[24]

D. Osokin, Real-time 2d multi-person pose estimation on cpu: lightweight openpose,in: ICPRAM 2019-Proceedings of the 8th International Conference on Pattern Recog-nition Applications and Methods, 2019, pp. 744-748.

[25]

A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, Mobilenets: efficient convolutional neural networks for mobile vision ap-plications, https://doi.org/10.48550/arXiv.1704.04861, 2017.

[26]

J. Pennington, R. Socher, C.D. Manning, Glove: global vectors for word represen-tation,in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532-1543.

[27]

H.W. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Logist. Q. 2 (1-2) (1955) 83-97.

[28]

X. Lin, C. Shi, Y. Zhan, Z. Yang, Y. Wu, D. Tao, Td2-net: toward denoising and de-biasing for video scene graph generation, Proc. AAAI Conf. Artif. Intell. 38 (2024) 3495-3503.

[29]

Y. Teng, L. Wang, Z. Li, G. Wu,Target adaptive context aggregation for video scene graph generation, in:Proceedings of the IEEE/CVF International Conference on Com-puter Vision (ICCV), 2021, pp. 13688-13697.

[30]

K. Kim, K. Yoon, Y. In, J. Jeon, J. Moon, D. Kim, C. Park, Weakly Supervised Video Scene Graph Generation via Natural Language Supervision, https://doi.org/10.48550/arXiv.2502.15370, 2025.

[31]

Y. Cong, M.Y. Yang, B. Rosenhahn, Reltr: relation transformer for scene graph gen-eration, IEEE Trans. Pattern Anal. Mach. Intell. 45 (9) (2023) 11169-11183.

[32]

R. Peddi, S. Singh, Saurabh, P. Singla, V. Gogate, Towards scene graph anticipation, in: European Conference on Computer Vision, Springer, 2024, pp. 159-175.

[33]

J. Zhang, K.J. Shih, A. Elgammal, A. Tao, B. Catanzaro,Graphical contrastive losses for scene graph parsing, in:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11535-11543.

[34]

X. Lin, C. Ding, J. Zeng, D. Tao, Gps-net: graph property sensing network for scene graph generation,in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3746-3753.

PDF

4

Accesses

0

Citation

Detail

Sections
Recommended

/