Video-based similar gesture action recognition using deep learning and GAN-based approaches

Publication Type:
Issue Date:
Full metadata record
Human action is not merely a matter of presenting patterns of motion of different parts of the body, in addition, it is also a description of intention, emotion and thoughts of the person. Hence, it has become a crucial component in human behavior analysis and understanding. Human action recognition has a wide variety of applications such as surveillance, robotics, health care, video searching and human-computer interaction. Analysing human actions manually is tedious and easily prone to errors. Therefore, computer scientists have been trying to bring the abilities of cognitive video understanding to human action recognition systems by using computer vision techniques. However, human action recognition is a complex task in computer vision because of the camera motion, occlusion, background cluttering, viewpoint variation, execution rate and similar gestures. These challenges significantly degrade the performance of the human action recognition system. The purpose of this research is to propose solutions based on traditional machine learning methods as well as the state-of-the-art deep learning methods to automatically process video-based human action recognition. This thesis investigates three research areas of video-based human action recognition: traditional human action recognition, similar gesture action recognition, and data augmentation for human action recognition. To start with, the feature-based methods using classic machine learning algorithms have been studied. Recently, deep convolutional neural networks (CNN) have taken their place in the computer vision and human action recognition research areas and have achieved tremendous success in comparison to traditional machine learning techniques. Current state-of-the-art deep convolutional neural networks were used for the human action recognition task. Furthermore, recurrent neural networks (RNN) and its variation of long-short term memory (LSTM) are used to process the time series features which are handcrafted features or extracted from the CNN. However, these methods suffer from similar gestures, which appear in the human action videos. Thus, a hierarchical classification framework is proposed for similar gesture action recognition, and the performance is improved by the multi-stage classification approach. Additionally, the framework has been modified into an end-to-end system, therefore, the similar gestures can be processed by the system automatically. In this study, a novel data augmentation framework for action recognition has been proposed, the objective is to generate well learnt video frames from action videos which can enlarge the dataset size as well as the feature bias. It is very important for a human action recognition system to recognize the actions with similar gestures as accurately as possible. For such a system, a generative adversarial net (GAN) is applied to learn the original video datasets and generate video frames by playing an adversarial game. Furthermore, a framework is developed for classifying the original dataset in the first place to obtain the confusion matrix using a CNN. The similar gesture actions will be paired based on the confusion matrix results. The final classification result will be applied on the fusion dataset which contains both original and generated video frames. This study will provide realtime and practical solutions for autonomous human action recognition system. The analysis of similar gesture actions will improve the performance of the existing CNN-based approaches. In addition, the GAN-based approaches from computer vision have been applied to the graph embedding area, because graph embedding is similar to image embedding but used for different purposes. Unlike the purpose of the GAN in computer vision for generating the images, the GAN in graph embedding can be used to regularize the embedding. So the proposed methods are able to reconstruct both structural characteristics and node features, which naturally possess the interaction between these two sources of information while learning the embedding.
Please use this identifier to cite or link to this item: