GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Deng, X; Shi, H; Huang, R; Li, C; Xu, H; Han, J; Kwok, J; Zhao, S; Zhang, W; Liang, X

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

Deng, X Shi, H Huang, R Li, C Xu, H Han, J Kwok, J Zhao, S Zhang, W Liang, X

Permalink

Publisher:: IEEE
Publication Type:: Conference Proceeding
Citation:: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 22121-22132
Issue Date:: 2024-01-15

Closed Access

	Filename	Description	Size
	1700586.pdf	Published version	1.29 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Deng, X
dc.contributor.author	Shi, H
dc.contributor.author	Huang, R
dc.contributor.author	Li, C
dc.contributor.author	Xu, H
dc.contributor.author	Han, J
dc.contributor.author	Kwok, J
dc.contributor.author	Zhao, S
dc.contributor.author	Zhang, W
dc.contributor.author	Liang, X
dc.date	2023-10-01
dc.date.accessioned	2024-05-13T04:23:11Z
dc.date.available	2024-05-13T04:23:11Z
dc.date.issued	2024-01-15
dc.identifier.citation	2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2024, 00, pp. 22121-22132
dc.identifier.isbn	979-8-3503-0719-1
dc.identifier.issn	1550-5499
dc.identifier.uri	http://hdl.handle.net/10453/178903
dc.description.abstract	Cross modal pre training has shown impressive performance on a wide range of downstream tasks benefiting from massive image text pairs collected from the Internet In practice online data are growing constantly highlighting the importance of the ability of pre trained model to learn from data that is continuously growing Existing works on cross modal pre training mainly focus on training a network with fixed architecture However it is impractical to limit the model capacity when considering the continuously growing nature of pre training data in real world applications On the other hand it is important to utilize the knowledge in the current model to obtain efficient training and better performance To address the above issues in this paper we propose GrowCLIP a data driven automatic model growing algorithm for contrastive language image pre training with continuous image text pairs as input Specially we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios And the shared encoder is proposed in our growth space to enhance the degree of cross modal fusion Besides we explore the effect of growth in different dimensions which could provide future references for the design of cross modal model architecture Finally we employ parameter inheriting with momentum PIM to maintain the previous knowledge and address the issue of the local minimum dilemma Compared with the existing methods GrowCLIP improves 2 3 average top 1 accuracy on zero shot image classification of 9 downstream tasks As for zero shot image retrieval GrowCLIP can improve 1 2 for top 1 image to text recall on Flickr30K dataset
dc.language	en
dc.publisher	IEEE
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision (ICCV)
dc.relation.ispartof	2023 IEEE/CVF International Conference on Computer Vision
dc.relation.isbasedon	10.1109/iccv51070.2023.02027
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
dc.type	Conference Proceeding
utslib.citation.volume	00
utslib.location.activity	Paris, France
utslib.copyright.status	closed_access	*
dc.date.updated	2024-05-13T04:23:09Z
pubs.finish-date	2023-10-06
pubs.place-of-publication	Piscataway, USA
pubs.publication-status	Published
pubs.start-date	2023-10-01
pubs.volume	00
dc.location	Piscataway, USA

Abstract:

Cross modal pre training has shown impressive performance on a wide range of downstream tasks benefiting from massive image text pairs collected from the Internet In practice online data are growing constantly highlighting the importance of the ability of pre trained model to learn from data that is continuously growing Existing works on cross modal pre training mainly focus on training a network with fixed architecture However it is impractical to limit the model capacity when considering the continuously growing nature of pre training data in real world applications On the other hand it is important to utilize the knowledge in the current model to obtain efficient training and better performance To address the above issues in this paper we propose GrowCLIP a data driven automatic model growing algorithm for contrastive language image pre training with continuous image text pairs as input Specially we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios And the shared encoder is proposed in our growth space to enhance the degree of cross modal fusion Besides we explore the effect of growth in different dimensions which could provide future references for the design of cross modal model architecture Finally we employ parameter inheriting with momentum PIM to maintain the previous knowledge and address the issue of the local minimum dilemma Compared with the existing methods GrowCLIP improves 2 3 average top 1 accuracy on zero shot image classification of 9 downstream tasks As for zero shot image retrieval GrowCLIP can improve 1 2 for top 1 image to text recall on Flickr30K dataset

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/178903