Skip to main content
placeholder image

Structured Images for RGB-D Action Recognition

Conference Paper


Abstract


  • This paper presents an effective yet simple video representation for RGB-D based action recognition. It proposes to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling. Different from previous works that applied one Convolutional Neural Network (ConvNet) for each part/joint separately, one pair of structured dynamic images is constructed from depth maps at each granularity level and serves as the input of a ConvNet. The structured dynamic image not only preserves the spatial-temporal information but also enhances the structure information across both body parts/joints and different temporal scales. In addition, it requires low computational cost and memory to construct. This new representation, referred to as Spatially Structured Dynamic Depth Images (S2DDI), aggregates from global to fine-grained levels motion and structure information in a depth sequence, and enables us to fine-tune the existing ConvNet models trained on image data for classification of depth sequences, without a need for training the models afresh. The proposed representation is evaluated on five benchmark datasets, namely, MSRAction3D, G3D, MSRDailyActivity3D, SYSU 3D HOI and UTD-MHAD datasets and achieves the state-of-the-art results on all five datasets.

Publication Date


  • 2017

Citation


  • Wang, P., Wang, S., Gao, Z., Hou, Y., & Li, W. (2017). Structured Images for RGB-D Action Recognition. In Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 Vol. 2018-January (pp. 1005-1014). doi:10.1109/ICCVW.2017.123

Scopus Eid


  • 2-s2.0-85043229432

Start Page


  • 1005

End Page


  • 1014

Volume


  • 2018-January

Abstract


  • This paper presents an effective yet simple video representation for RGB-D based action recognition. It proposes to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling. Different from previous works that applied one Convolutional Neural Network (ConvNet) for each part/joint separately, one pair of structured dynamic images is constructed from depth maps at each granularity level and serves as the input of a ConvNet. The structured dynamic image not only preserves the spatial-temporal information but also enhances the structure information across both body parts/joints and different temporal scales. In addition, it requires low computational cost and memory to construct. This new representation, referred to as Spatially Structured Dynamic Depth Images (S2DDI), aggregates from global to fine-grained levels motion and structure information in a depth sequence, and enables us to fine-tune the existing ConvNet models trained on image data for classification of depth sequences, without a need for training the models afresh. The proposed representation is evaluated on five benchmark datasets, namely, MSRAction3D, G3D, MSRDailyActivity3D, SYSU 3D HOI and UTD-MHAD datasets and achieves the state-of-the-art results on all five datasets.

Publication Date


  • 2017

Citation


  • Wang, P., Wang, S., Gao, Z., Hou, Y., & Li, W. (2017). Structured Images for RGB-D Action Recognition. In Proceedings - 2017 IEEE International Conference on Computer Vision Workshops, ICCVW 2017 Vol. 2018-January (pp. 1005-1014). doi:10.1109/ICCVW.2017.123

Scopus Eid


  • 2-s2.0-85043229432

Start Page


  • 1005

End Page


  • 1014

Volume


  • 2018-January