Spatio-temporal feature extraction with a global-local Transformer model for video scene graph generation✩
Rongsen Wu , Jie Xu , Hao Zheng , Zhiyuan Xu , Zixuan Li , Shixue Cheng , Shumao Zhang
›› 2026, Vol. 12 ›› Issue (2) : 364 -374.
In the field of video scene graph generation, spatio-temporal feature extraction and the long-tail effect in relationship classification are core research issues. This paper proposes extracting spatio-temporal features using the global-local Transformer model for video scene graph generation. Methods based on the Transformer architecture and attention mechanism enrich the semantic information of spatio-temporal features in videos, thereby improving the accuracy of relationship classification. In the feature processing module, pose features are introduced to strengthen the semantic representation of objects. In the spatial feature encoding module, a local spatial visibility matrix based on bounding boxes and key points of human pose features is proposed to add the issue of insufficient attention to local details in traditional Transformer encoders. In the temporal feature encoding module, a global random frame extraction strategy is proposed, which considers global temporal features while also taking computational complexity into account. In the relation classification module, to address the uneven distribution of object and relation categories in the Action Genome dataset, a relation classification loss function based on bipartite graph matching and Focal Loss is proposed, which alleviates the long-tail effect in relation classification and improves the accuracy.
Video scene graph generation / Transformer / Pose features / Visibility matrix / Bipartite graph matching
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
/
| 〈 |
|
〉 |