Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Publisher:
IEEE
Publication Type:
Conference Proceeding
Citation:
Proceedings of the IEEE International Conference on Computer Vision, 2024, 00, pp. 6478-6487
Issue Date:
2024-01-01
Filename Description Size
1704446.pdfPublished version1.49 MB
Adobe PDF
Full metadata record
Current methods for open-vocabulary object detection (OVOD) rely on a pre-trained vision-language model (VLM) to acquire the recognition ability. In this paper, we propose a simple yet effective framework to Distill the Knowledge from the VLM to a DETR-like detector, termed DK-DETR. Specifically, we present two ingenious distillation schemes named semantic knowledge distillation (SKD) and relational knowledge distillation (RKD). To utilize the rich knowledge from the VLM systematically, SKD transfers the semantic knowledge explicitly, while RKD exploits implicit relationship information between objects. Furthermore, a distillation branch including a group of auxiliary queries is added to the detector to mitigate the negative effect on base categories. Equipped with SKD and RKD on the distillation branch, DK-DETR improves the detection performance of novel categories significantly and avoids disturbing the detection of base categories. Extensive experiments on LVIS and COCO datasets show that DK-DETR surpasses existing OVOD methods under the setting that the base-category supervision is solely available. The code and models are available at https://github.com/hikvision-research/opera.
Please use this identifier to cite or link to this item: