CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Sun, Z; Du, X; Song, F; Ni, M; Li, L

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Sun, Z Du, X Song, F Ni, M Li, L

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: WWW 2022 - Proceedings of the ACM Web Conference 2022, 2022, pp. 652-660
Issue Date:: 2022-04-25

Closed Access

	Filename	Description	Size
	2110.12925.pdf	Published version	2.07 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Sun, Z
dc.contributor.author	Du, X
dc.contributor.author	Song, F
dc.contributor.author	Ni, M
dc.contributor.author	Li, L
dc.date	2022-04-25
dc.date.accessioned	2023-06-28T02:44:19Z
dc.date.available	2023-06-28T02:44:19Z
dc.date.issued	2022-04-25
dc.identifier.citation	WWW 2022 - Proceedings of the ACM Web Conference 2022, 2022, pp. 652-660
dc.identifier.isbn	9781450390965
dc.identifier.uri	http://hdl.handle.net/10453/170929
dc.description.abstract	Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.
dc.language	en
dc.publisher	ACM
dc.relation.ispartof	WWW 2022 - Proceedings of the ACM Web Conference 2022
dc.relation.ispartof	ACM Web Conference
dc.relation.isbasedon	10.1145/3485447.3512225
dc.rights	info:eu-repo/semantics/closedAccess
dc.title	CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning
dc.type	Conference Proceeding
utslib.location.activity	Virtual Event, Lyon, France
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
pubs.consider-herdc	false
dc.date.updated	2023-06-28T02:44:18Z
pubs.finish-date	2022-04-29
pubs.place-of-publication	USA
pubs.publication-status	Published
pubs.start-date	2022-04-25
dc.location	USA

Abstract:

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170929