Image emotion recognition using region-based multi-level features

Publication Type:
Issue Date:
Full metadata record
According to psychology studies, human emotion can be invoked by different kinds of visual stimuli. Recognizing human emotion automatically from visual contents has been studied for years. Emotion recognition is an essential component of human-computer interaction and has been involved in many applications, such as advertisement, entertainment, education, and accommodation system. Compared to other computer vision tasks, visual emotion recognition is more challenging as it involves analyzing abstract emotional states which are complexity and subjectivity. For complexity, emotion can be evoked by different kinds of visual content and the same kind of visual content may evoke various kinds of emotions. For subjectivity, people from different cultural background may have different kinds of emotions for the same kind of visual content. Automatic visual emotion recognition system consists of several tuned processing steps which are integrated into a pipeline. Previous methods often relay on hand-tuned features which can introduce strong assumptions about the properties of human emotion. However, the vague assumptions related to the abstract concept of emotion and learning the processing pipeline from limited data often narrows the generalization of the visual emotion recognition system. Considering the two challenges on complexity and subjectivity as mentioned above, more information should be used for image-based emotion analysis. Features from different level including low-level visual features, such as color, shape, line and texture, mid-level image aesthetics and composition and high-level image semantic need to be taken into consideration. Local information extracted from emotion-related image regions can provide further support for image emotion classification. In recent years, deep learning methods have achieved great success in many computer vision tasks. The state-of-art deep learning methods can achieve performances slightly under or even above human performances in some challenging tasks, such as facial recognition and object detection. The Convolutional Neural Networks applied in deep learning methods consist of hierarchical structures which can learn increasingly abstract concept from local to global view than hand-crafted features. This observation suggests exploring the application of CNN structure to image emotion classification. This thesis is based on three articles, which contribute to the field of image emotion classification. The first article is an in-depth analysis of the impact of emotional regions in images for image emotion classification. In the model, multi-scale blocks are first extracted from the image to cover different emotional regions. Then, in order to bridge the gap between low-level visual features and high-level emotions, a mid-level representation, exploiting Probabilistic Latent Semantic Analysis (pLSA) is introduced to learn a set of mid-level representations as a set of latent topics from affective images. Finally, Multiple Instance Learning (MIL), based on the multi-scale blocks extracted from an image, is employed to reduce the need for exact labeling and analyze the image emotion. The experimental results demonstrate the effectiveness of emotional regions in image emotion classification. However, one drawback of the method described in the first article is the hand-crafted using in this method is only valid for limited domains of affective image. The experimental results show that the performance of the method in abstracting paintings, whose emotion is mainly conveyed by low-level visual features, is not as well as in images that contain emotional content. CNN can automatically learn generalized deep features for various kinds of affective images. Therefore, in the second article, we analyze the different level of deep representations extracted using CNN from affective images. A comparison of CNN models with different modalities that exploit different level of deep representations shows the significant improvement of our proposed network fusing different level of deep representations for image emotion recognition. In addition to the proposed model, a Recurrent Neural Network (RNN) with bi-direction Gated Recurrent Unit (GRU) can be added to deal with the correlations between different level of deep representations to further improve the performance of the proposed model. The last article proposes a new framework based on Region-based CNN (RCNN) to integrate the different level of deep representations extracted from both global and local view. The framework consists of a Feature Pyramid Network (FPN) to extract and fuse different level of deep representations and a RCNN to detect emotional regions in the image. What's more, the framework also considers the label noise existing in the training dataset, an estimated emotion distribution derived from the reliability of emotion label of the image is used to improve the image emotion classification results. The integrated feature and new loss function considering label noise help the framework to achieve state-of-the-art performance for image emotion classification. In summary, this thesis explores and develops a deep learning framework using region-based multi-level features for image emotion recognition, making significant steps towards the final goal of efficiently recognizing emotion from visual contents.
Please use this identifier to cite or link to this item: