Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Li, L; Miao, J; Shi, D; Tan, W; Ren, Y; Yang, Y; Pu, S

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

Li, L Miao, J Shi, D Tan, W Ren, Y Yang, Y

Pu, S

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: Proceedings of the IEEE International Conference on Computer Vision, 2024, 00, pp. 6478-6487
Issue Date:: 2024-01-01

Closed Access

	Filename	Description	Size
	1704446.pdf	Published version	1.49 MB	Adobe PDF	View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Li, L
dc.contributor.author	Miao, J
dc.contributor.author	Shi, D
dc.contributor.author	Tan, W
dc.contributor.author	Ren, Y
dc.contributor.author	Yang, Y https://orcid.org/0000-0002-0512-880X
dc.contributor.author	Pu, S
dc.date	2023-10-01
dc.date.accessioned	2024-08-07T04:48:43Z
dc.date.available	2024-08-07T04:48:43Z
dc.date.issued	2024-01-01
dc.identifier.citation	Proceedings of the IEEE International Conference on Computer Vision, 2024, 00, pp. 6478-6487
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/180235
dc.description.abstract	Current methods for open-vocabulary object detection (OVOD) rely on a pre-trained vision-language model (VLM) to acquire the recognition ability. In this paper, we propose a simple yet effective framework to Distill the Knowledge from the VLM to a DETR-like detector, termed DK-DETR. Specifically, we present two ingenious distillation schemes named semantic knowledge distillation (SKD) and relational knowledge distillation (RKD). To utilize the rich knowledge from the VLM systematically, SKD transfers the semantic knowledge explicitly, while RKD exploits implicit relationship information between objects. Furthermore, a distillation branch including a group of auxiliary queries is added to the detector to mitigate the negative effect on base categories. Equipped with SKD and RKD on the distillation branch, DK-DETR improves the detection performance of novel categories significantly and avoids disturbing the detection of base categories. Extensive experiments on LVIS and COCO datasets show that DK-DETR surpasses existing OVOD methods under the setting that the base-category supervision is solely available. The code and models are available at https://github.com/hikvision-research/opera.
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	Proceedings of the IEEE International Conference on Computer Vision
dc.relation.ispartof	IEEE International Conference on Computer Vision
dc.relation.isbasedon	10.1109/ICCV51070.2023.00598
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Paris, France
pubs.organisational-group	University of Technology Sydney
pubs.organisational-group	University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2024-08-07T04:48:41Z
pubs.finish-date	2023-10-06
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-01
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Current methods for open-vocabulary object detection (OVOD) rely on a pre-trained vision-language model (VLM) to acquire the recognition ability. In this paper, we propose a simple yet effective framework to Distill the Knowledge from the VLM to a DETR-like detector, termed DK-DETR. Specifically, we present two ingenious distillation schemes named semantic knowledge distillation (SKD) and relational knowledge distillation (RKD). To utilize the rich knowledge from the VLM systematically, SKD transfers the semantic knowledge explicitly, while RKD exploits implicit relationship information between objects. Furthermore, a distillation branch including a group of auxiliary queries is added to the detector to mitigate the negative effect on base categories. Equipped with SKD and RKD on the distillation branch, DK-DETR improves the detection performance of novel categories significantly and avoids disturbing the detection of base categories. Extensive experiments on LVIS and COCO datasets show that DK-DETR surpasses existing OVOD methods under the setting that the base-category supervision is solely available. The code and models are available at https://github.com/hikvision-research/opera.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/180235