Deep representation learning for keypoint localization

Publication Type:
Thesis
Issue Date:
2017
Full metadata record
Keypoint localization aims to locate points of interest from the input image. This technique has become an important tool for many computer vision tasks such as fine-grained visual categorization, object detection, and pose estimation. Tremendous effort, therefore, has been devoted to improving the performance of keypoint localization. However, most of the proposed methods supervise keypoint detectors using a confidence map generated from ground-truth keypoint locations. Furthermore, the maximum achievable localization accuracy differs from keypoint to keypoint, because it is determined by the underlying keypoint structures. Thus the keypoint detector often fails to detect ambiguous keypoints if trained with strict supervision, that is, permitting only a small localization error. Training with looser supervision could help detect the ambiguous keypoints, but this comes at a cost to localization accuracy for those keypoints with distinctive appearances. In this thesis, we propose hierarchically supervised nets (HSNs), a method that imposes hierarchical supervision within deep convolutional neural networks (CNNs) for keypoint localization. To achieve this, we firstly propose a fully convolutional Inception network with several branches of varying depths to obtain hierarchical feature representations. Then, we build a coarse part detector on top of each branch of features and a fine part detector which takes features from all the branches as the input. Collecting image data with keypoint annotations is harder than with image labels. One may collect images from Flickr or Google images by searching keywords and then perform refinement processes to build a classification dataset, while keypoint annotation requires human to click the rough location of the keypoint for each image. To address the problem of insufficient part annotations, we propose a part detection framework that combines deep representation learning and domain adaptation within the same training process. We adopt one of the coarse detector from HSNs as the baseline and perform a quantitative evaluation on CUB200-2011 and BirdSnap dataset. Interestingly, our method trained on only 10 species images achieves 61.4% PCK accuracy on the testing set of 190 unseen species. Finally, we explore the application of keypoint localization in the task of fine-grained visual categorization. We propose a new part-based model that consists of a localization module to detect object parts (where pathway) and a classification module to classify fine-grained categories at the subordinate level (what pathway). Experimental results reveal that our method with keypoint localization achieves the state-of-the-art performance on Caltech-UCSD Birds-200-2011 dataset.
Please use this identifier to cite or link to this item: