Spatial-Temporal Unfold Transformer for Skeleton-based Human Action Recognition

Authors

  • HU CUI Nagaoka University of Technology
  • Tessai Hayama Nagaoka University of Technology

DOI:

https://doi.org/10.52731/liir.v004.167

Abstract

Transformer-based architecture has been proven to be effective for action and gesture recognition. In contrast to Graph Convolutional Networks (GCNs), it can automatically model joint relationships through attention mechanisms without any predefined topological graph. However, most of the previous approaches do attention to the spatial and temporal dimensions in a completely decoupled manner, ignoring the local dynamic features of the action and human body semantics. And the performance lag behind state-of-the-art GCN-based methods. To overcome the issues, we propose a Spatial-Temporal Unfold Attention Network (STUT). Firstly, it locally unfolds skeleton data in the temporal dimension such that all neighboring frames are included in each unfolded frame. Then, the human body structural semantics of actions are extracted by a hypergraph convolution used for guiding the local spatio-temporal attention operation in each unfolded frame.
In addition, in order to distinguish the importance of different frames, we introduce temporal squeezing attention (TSE) for multi-scale global spatial-temporal modeling. Extensive experiments are conducted and our model achieves 96.4\% on NW-UCLA and 96.91\% / 94.88\% on SHREC17 (14-gestures / 28-gestures).

References

Y. Chen, Y. Tian, and M. He, “Monocular human pose estimation: A survey of deep learning-based methods,” Computer vision and image understanding, vol. 192, p. 102897, 2020.

T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.

S. Brody, U. Alon, and E. Yahav, “How attentive are graph attention networks?” arXiv preprint arXiv:2105.14491, 2021.

S. Bai, F. Zhang, and P. H. Torr, “Hypergraph convolution and hypergraph attention,” Pattern Recognition, vol. 110, p. 107637, 2021.

S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.

K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in Proceedings of the IEEE/CVF

conference on computer vision and pattern recognition, 2020, pp. 183–192.

H. Yang, D. Yan, L. Zhang, Y. Sun, D. Li, and S. J. Maybank, “Feedback graph convolutional network for skeleton-based action recognition,” IEEE Transactions on Image Processing, vol. 31, pp. 164–175, 2021.

Y. Li, Z. He, X. Ye, Z. He, and K. Han, “Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition,” EURASIP Journal on Image and Video Processing, vol. 2019, no. 1, pp. 1–7, 2019.

H.-g. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 186–20 196.

J.-H. Song, K. Kong, and S.-J. Kang, “Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 9, pp. 6227–6239, 2022.

Y. Chen, L. Zhao, X. Peng, J. Yuan, and D. N. Metaxas, “Construct dynamic graphs for hand gesture recognition via spatial-temporal attention,” in Proc. Brit. Mach. Vis. Conf, 2019, pp. 1–13.

C. Plizzari, M. Cannici, and M. Matteucci, “Spatial temporal transformer network for skeleton-based action recognition,” in Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, Proceedings, Part III. Springer, 2021, pp. 694–701.

L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition,” in Proceedings of the Asian Conference on Computer Vision, 2020.

X. Qin, R. Cai, J. Yu, C. He, and X. Zhang, “An efficient self-attention network for skeleton-based action recognition,” Scientific Reports, vol. 12, no. 1, p. 4111, 2022.

H. Qiu, B. Hou, B. Ren, and X. Zhang, “Spatio-temporal tuples transformer for skeleton-based action recognition,” arXiv preprint arXiv:2201.02849, 2022.

L. Ramp´aˇsek, M. Galkin, V. P. Dwivedi, A. T. Luu, G. Wolf, and D. Beaini, “Recipe for a general, powerful, scalable graph transformer,” Advances in Neural Information Processing Systems, vol. 35, pp. 14 501–14 515, 2022.

Y. Zhou, C. Li, Z.-Q. Cheng, Y. Geng, X. Xie, and M. Keuper, “Hypergraph transformer for skeleton-based action recognition,” arXiv preprint arXiv:2211.09590, 2022.

V. P. Dwivedi and X. Bresson, “A generalization of transformer networks to graphs,” arXiv preprint arXiv:2012.09699, 2020.

A. Hatamizadeh, G. Heinrich, H. Yin, A. Tao, J. M. Alvarez, J. Kautz, and P. Molchanov, “Fastervit: Fast vision transformers with hierarchical attention,” arXiv preprint arXiv:2306.06189, 2023.

Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.

Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 009–12 019.

P. Zhang, C. Lan, W. Zeng, J. Xing, J. Xue, and N. Zheng, “Semantics-guided neural networks for efficient skeleton-based human action recognition,” in proceedings of the

IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 1112– 1121.

Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 3558–3565.

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

Z. Chen, S. Li, B. Yang, Q. Li, and H. Liu, “Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 2, 2021, pp. 1113–1122.

H. Duan, J. Wang, K. Chen, and D. Lin, “Dg-stgcn: dynamic spatial-temporal modeling for skeleton-based action recognition,” arXiv preprint arXiv:2210.05895, 2022.

Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 143– 152.

F. Ye, S. Pu, Q. Zhong, C. Li, D. Xie, and H. Tang, “Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 55–63.

Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 359–13 368.

L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF conference

on computer vision and pattern recognition, 2019, pp. 12 026–12 035.

W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.

L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 7912–7921.

R. Bai, M. Li, B. Meng, F. Li, M. Jiang, J. Ren, and D. Sun, “Hierarchical graph convolutional skeleton transformer for action recognition,” in 2022 IEEE International Conference on Multimedia and Expo (ICME), 2022, pp. 01–06.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, vol. 27,

L. Sevilla-Lara, Y. Liao, F. G¨uney, V. Jampani, A. Geiger, and M. J. Black, “On the integration of optical flow and action recognition,” in Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40. Springer, 2019, pp. 281–297.

J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view action modeling, learning and recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 2649–2656.

Q. D. Smedt, H. Wannous, J.-P. Vandeborre, J. Guerry, B. L. Saux, and D. Filliat, “3D Hand Gesture Recognition Using a Depth and Skeletal Dataset,” in Eurographics Workshop on 3D Object Retrieval, I. Pratikakis, F. Dupont, and M. Ovsjanikov, Eds. The Eurographics Association, 2017.

X. S. Nguyen, L. Brun, O. L´ezoray, and S. Bougleux, “A neural network based on spd manifold learning for skeleton-based hand gesture recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 036–12 045.

K. Xu, F. Ye, Q. Zhong, and D. Xie, “Topology-aware convolutional neural network for efficient skeleton-based action recognition,” in Proceedings of the AAAI Conference

on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2866–2874.

F. Yang, Y. Wu, S. Sakti, and S. Nakamura, “Make skeleton-based action recognition model smaller, faster and better,” in Proceedings of the ACM multimedia asia, 2019, pp. 1–6.

Downloads

Published

2023-12-20