Disentangled Code Representation Learning for Multiple Programming Languages

Zhang, J; Hong, H; Zhang, Y; Wan, Y; Liu, Y; Sui, Y

Disentangled Code Representation Learning for Multiple Programming Languages

Zhang, J Hong, H Zhang, Y Wan, Y Liu, Y Sui, Y

Permalink

Publication Type:: Conference Proceeding
Citation:: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4454-4466
Issue Date:: 2021-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted versionAdobe PDF (2.3 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Zhang, J
dc.contributor.author	Hong, H
dc.contributor.author	Zhang, Y
dc.contributor.author	Wan, Y
dc.contributor.author	Liu, Y
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.date.accessioned	2022-05-20T14:09:29Z
dc.date.available	2022-05-20T14:09:29Z
dc.date.issued	2021-01-01
dc.identifier.citation	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4454-4466
dc.identifier.isbn	9781954085541
dc.identifier.uri	http://hdl.handle.net/10453/157579
dc.description.abstract	Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.
dc.language	en
dc.relation.ispartof	Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Disentangled Code Representation Learning for Multiple Programming Languages
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
dc.date.updated	2022-05-20T14:09:25Z
pubs.publication-status	Published

Abstract:

Developing effective distributed representations of source code is fundamental yet challenging for many software engineering tasks such as code clone detection, code search, code translation and transformation. However, current code embedding approaches that represent the semantic and syntax of code in a mixed way are less interpretable and the resulting embedding can not be easily generalized across programming languages. In this paper, we propose a disentangled code representation learning approach to separate the semantic from the syntax of source code under a multi-programming-language setting, obtaining better interpretability and generalizability. Specially, we design three losses dedicated to the characteristics of source code to enforce the disentanglement effectively. We conduct comprehensive experiments on a real-world dataset composed of programming exercises implemented by multiple solutions that are semantically identical but grammatically distinguished. The experimental results validate the superiority of our proposed disentangled code representation, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/157579