Learning Object Detection with Weak Supervision

Publication Type:
Thesis
Issue Date:
2023
Full metadata record
Deep learning technique has achieved astonishing success in many computer vision applications. However, training deep models typically requires large-scale datasets with elaborate annotations. Collecting and annotating large-scale datasets are laborious, especially for object detection --- a challenging vision task. A promising solution for reducing costs is to train models with weak supervision, which provides a good trade-off between model performance and annotation efficiency. This thesis dedicates to weakly supervised learning in two object-centered application scenarios, i.e., general object detection and RGB-D salient object detection. The first task is to predict the category of an object and its location in the given image with image-level weak supervision. A pyramidal multiple instance detection network is first introduced to reduce the exposure of local discriminative proposal regions, alleviating the local optimum issue in training detectors with only image-level annotations. Besides learning detectors with only image-level supervision, two more practical scenarios in weakly supervised object detection are considered. With a well-annotated object detection dataset, this thesis further investigates how to scale detectors to novel domains or categories using weak supervision. Concretely, a holistic and hierarchical feature alignment R-CNN is presented to perform coarse-to-fine alignments in pace with the detection pipeline and effectively reduce the discrepancy between different domains with weak supervision. A cyclic self-training framework with a proposal weight modulation module is introduced to compensate for the instance-level supervision of novel classes and adaptively adjust loss weights for the training samples. The second task is to predict pixel-level masks for the foreground objects in paired RGB-D inputs (i.e., images and depth maps) with scribble-based weak supervision. This thesis explores annotator-friendly scribble annotations for training models. A dual-modal edge-guided network and a prediction consistency training method are developed to fully take advantage of the complementary information from both modalities and exploit the information residing in the unlabeled pixels, respectively. Extensive experiments are conducted and analyzed to evaluate the effectiveness of the proposed approaches with weak supervision. Competitive performance on commonly used benchmarks verifies the effectiveness and universality.
Please use this identifier to cite or link to this item: