Skeleton-based Human Action Recognition: From 3D Pose Estimation to Action Recognition

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Skeleton-based human action recognition has gained substantial interest in the computer vision domain for decades, and it mainly includes human pose estimation and action recognition. A systematic literature review in these areas reveals four research gaps: (1) existing weakly-supervised 3D pose estimation methods suffer from low 3D pose estimation accuracy on complex 3D poses, (2) existing 3D pose estimation methods have less model generalization on unseen poses, (3) existing skeleton-based action recognition methods typically use more complicated model to improve accuracy on challenging actions, and (4) existing self-supervised skeleton-based action recognition methods cannot fully explore a large amount of unlabelled 3D skeleton sequences. This thesis presents a suite of new techniques to bridge these research gaps. Firstly, to improve the 3D pose estimation accuracy under a weakly-supervised learning framework, we propose a 3D pose estimation model using a Transformer encoder-based architecture with a trainable gate called PoseGate-Former. It can significantly improve the weakly-supervised 3D pose estimation accuracy. Furthermore, it can reduce the possibility of overfitting on some action categories and has higher accuracy on estimating complicated 2D poses. Secondly, to improve the model generalization on unseen poses, we propose a novel human pose generator, named Pose Generator with Unbiased Learning (PoseGU), that generates diverse poses based on a small size of seed samples. It uses Counterfactual Risk Minimization to pursue an unbiased evaluation objective. Our PoseGU outperforms almost all the state-of-the-art 3D human pose estimation methods, and our empirical analysis also proves PoseGU generates 3D poses with improved data diversity and better generalization ability. Thirdly, to improve the action recognition accuracy on challenging actions without introducing complicated model, we introduce a novel Action Features Enhanced CNN (AFE-CNN) to enhance the features of 3D skeleton-based actions for recognizing challenging actions. Our AFE-CNN is more robust to camera views and body sizes variation and significantly improves the recognition accuracy on challenging actions. Moreover, our AFE-CNN adopts a lightweight CNN model to decode images with action feature enhancement, which ensures a much lower computational burden than state-of-the-art methods. Fourthly, to fully explore a large amount of unlabelled 3D skeleton sequences, we propose a new Dual Min-Max Games (DMMG) based self-supervised skeleton action recognition method by augmenting unlabelled data in a contrastive learning framework. By fully exploiting the proposed DMMG, we can generate sufficient challenging contrastive pairs and thus achieve discriminative action feature representations from unlabelled skeleton data in a self-supervised manner.
Please use this identifier to cite or link to this item: