3D Human Pose Estimation in Different Environment Settings Using Deep Learning Methods

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Three-dimensional human pose estimation methods have received widespread attention in the field of computer vision, from which many related applications have been derived. Such technology can estimate 3D human skeletons in the real world from camera images. Although early methods used images from a monocular camera to train a neural network, such approaches suffered from ambiguous depth and self-occlusions. Researchers, therefore, began to explore a multi-view approach to alleviate such problems. As another challenge, in most studies, massive numbers of labeled training data have been required for network training. Annotating 3D poses as the ground-truth using traditional marker-based motion capture systems is an expensive process. Hence self-supervised methods have attracted significant attention because network training can be conducted using only weak or even no supervision instead of applying paired 2D-3D human pose annotations. We explore several methods using different parameters, such as a monocular camera, multiple individuals captured from multiple cameras, and a collaborative estimation using both cameras and radar. Three settings are related closely. The first setting is a multi-view multi-person detector. While it performs well, it needs lots of ground-truth data to train the network. Then we begin considering the second setting which does not need the labeled data for training. Then we begin to think about the drawback of the camera-based network, the solution is to add another kind of data, which is radar signals, to help train a more robust network. Three methods are described as follows: 1) We propose a novel end-to-end training scheme for multi-view multi-person 3D pose estimation. Our model back-propagates the gradients from the last 3D estimation step to the first 2D detection step, thereby significantly improving the efficiency, robustness, and accuracy of the 3D pose estimation. We also designed a multi-view 2D human pose dynamic matching algorithm, which can dynamically match the corresponding 2D poses detected in multiple views for each person involved. 2) We propose a two-branch self-supervised approach in a multi-view training setting to train a 2D-3D neural network without the use of 3D ground-truth labels. The entire model only relies on geometric information in the building of supervision signals. 3) We propose a novel unsupervised model that infers 3D human skeletons from radar signals. This method solves such problems as poor illumination, adverse weather conditions, or occluded body parts, which affect the camera, by training the network using both camera images and radar signals.
Please use this identifier to cite or link to this item: