Skip to main content
placeholder image

Trear: Transformer-Based RGB-D Egocentric Action Recognition

Journal Article


Abstract


  • In this article, we propose a transformer-based RGB-D egocentric action recognition framework, called Trear. It consists of two modules: 1) interframe attention encoder and 2) mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt a self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D data sets: 1) THU-READ and 2) first-person hand action, and one small data set, wearable computer vision systems, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

Publication Date


  • 2022

Citation


  • Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., & Li, W. (2022). Trear: Transformer-Based RGB-D Egocentric Action Recognition. IEEE Transactions on Cognitive and Developmental Systems, 14(1), 246-252. doi:10.1109/TCDS.2020.3048883

Scopus Eid


  • 2-s2.0-85099101893

Start Page


  • 246

End Page


  • 252

Volume


  • 14

Issue


  • 1

Abstract


  • In this article, we propose a transformer-based RGB-D egocentric action recognition framework, called Trear. It consists of two modules: 1) interframe attention encoder and 2) mutual-attentional fusion block. Instead of using optical flow or recurrent units, we adopt a self-attention mechanism to model the temporal structure of the data from different modalities. Input frames are cropped randomly to mitigate the effect of the data redundancy. Features from each modality are interacted through the proposed fusion block and combined through a simple yet effective fusion operation to produce a joint RGB-D representation. Empirical experiments on two large egocentric RGB-D data sets: 1) THU-READ and 2) first-person hand action, and one small data set, wearable computer vision systems, have shown that the proposed method outperforms the state-of-the-art results by a large margin.

Publication Date


  • 2022

Citation


  • Li, X., Hou, Y., Wang, P., Gao, Z., Xu, M., & Li, W. (2022). Trear: Transformer-Based RGB-D Egocentric Action Recognition. IEEE Transactions on Cognitive and Developmental Systems, 14(1), 246-252. doi:10.1109/TCDS.2020.3048883

Scopus Eid


  • 2-s2.0-85099101893

Start Page


  • 246

End Page


  • 252

Volume


  • 14

Issue


  • 1