You see what I want you to see: poisoning vulnerabilities in neural code search

Wan, Y; Zhang, S; Zhang, H; Sui, Y; Xu, G; Yao, D; Jin, H; Sun, L

You see what I want you to see: poisoning vulnerabilities in neural code search

Wan, Y Zhang, S Zhang, H Sui, Y

Xu, G

Yao, D Jin, H Sun, L

Permalink

Publisher:: ACM
Publication Type:: Conference Proceeding
Citation:: ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1233-1245
Issue Date:: 2022-11-07

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Download Accepted versionAdobe PDF (1.26 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wan, Y
dc.contributor.author	Zhang, S
dc.contributor.author	Zhang, H
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663
dc.contributor.author	Yao, D
dc.contributor.author	Jin, H
dc.contributor.author	Sun, L
dc.date.accessioned	2023-01-12T12:05:19Z
dc.date.available	2023-01-12T12:05:19Z
dc.date.issued	2022-11-07
dc.identifier.citation	ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1233-1245
dc.identifier.isbn	9781450394130
dc.identifier.uri	http://hdl.handle.net/10453/164890
dc.description.abstract	Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.
dc.language	en
dc.publisher	ACM
dc.relation.ispartof	ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
dc.relation.ispartof	Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
dc.relation.isbasedon	10.1145/3540250.3549153
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	"© ACM 2022. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ESEC/FSE 2022 - Proceedings of the 30th ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 1233-1245 https://dl.acm.org/doi/10.1145/3540250.3549153 "
dc.title	You see what I want you to see: poisoning vulnerabilities in neural code search
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.date.updated	2023-01-12T12:05:18Z
pubs.publication-status	Published

Abstract:

Searching and reusing code snippets from open-source software repositories based on natural-language queries can greatly improve programming productivity.Recently, deep-learning-based approaches have become increasingly popular for code search. Despite substantial progress in training accurate models of code search, the robustness of these models has received little attention so far. In this paper, we aim to study and understand the security and robustness of code search models by answering the following question: Can we inject backdoors into deep-learning-based code search models? If so, can we detect poisoned data and remove these backdoors? This work studies and develops a series of backdoor attacks on the deep-learning-based models for code search, through data poisoning. We first show that existing models are vulnerable to data-poisoning-based backdoor attacks. We then introduce a simple yet effective attack on neural code search models by poisoning their corresponding training dataset. Moreover, we demonstrate that attacks can also influence the ranking of the code search results by adding a few specially-crafted source code files to the training corpus. We show that this type of backdoor attack is effective for several representative deep-learning-based code search systems, and can successfully manipulate the ranking list of searching results. Taking the bidirectional RNN-based code search system as an example, the normalized ranking of the target candidate can be significantly raised from top 50% to top 4.43%, given a query containing an attacker targeted word, e.g., file. To defend a model against such attack, we empirically examine an existing popular defense strategy and evaluate its performance. Our results show the explored defense strategy is not yet effective in our proposed backdoor attack for code search systems.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/164890