SUGAR: A Sequence Unfolding Based Transformer Model for Group Activity Recognition
Yash Gondkar , Chengjie Zheng , Yumeng Yang , Shiqian Shen , Wei Ding , Ping Chen
Transactions on Artificial Intelligence ›› 2025, Vol. 1 ›› Issue (1) : 227 -245.
Deep learning models built upon Transformer architectures have led to substantial advancements in sequential data analysis. Nevertheless, their direct application to video-based tasks, such as Group Activity Recognition (GAR), remains constrained by the quadratic computational complexity and excessive memory requirements of global self-attention, especially when handling long video sequences. To overcome these limitations, we propose SUGAR: A Sequence Unfolding Based Transformer Model for Group Activity Recognition. Our approach introduces a novel sequence unfolding and folding mechanism that partitions long video sequences into overlapping local windows, enabling the model to concentrate attention within compact temporal regions. This local attention design dramatically reduces computational cost and memory footprint while maintaining high recognition accuracy. Within the Bi-Causal framework, SUGAR replaces conventional Transformer blocks, and experimental results on the Volleyball dataset demonstrate that our model achieves state-of-the-art performance, consistently exceeding 93% accuracy, with significantly improved efficiency. In addition, we investigate Lightning Attention 2 as an alternative linear-complexity attention module, identifying practical challenges such as increased memory usage and unstable convergence. To ensure robustness and training stability, we incorporate a dedicated safety mechanism that mitigates these issues. In summary, SUGAR offers a scalable, resource-efficient solution for group activity analysis in videos and exhibits strong potential for broader applications involving lengthy sequential data in computer vision and bioinformatics.
group activity recognition / transformer / linear attention / sequence modeling / deep learning / efficient architectures / video understanding
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
/
| 〈 |
|
〉 |