Cross-Language Binary-Source Code Matching with Intermediate Representations

Gui, Y; Wan, Y; Zhang, H; Huang, H; Sui, Y; Xu, G; Shao, Z; Jin, H

Cross-Language Binary-Source Code Matching with Intermediate Representations

Gui, Y Wan, Y Zhang, H Huang, H Sui, Y

Xu, G

Shao, Z Jin, H

Permalink

Publisher:: Institute of Electrical and Electronics Engineers (IEEE)
Publication Type:: Conference Proceeding
Citation:: Proceedings - 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, 2022, 00, pp. 601-612
Issue Date:: 2022-01-01

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

The embargo period expires on 31 Jul 2024

Adobe PDF

Download Accepted versionAdobe PDF (1.18 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Gui, Y
dc.contributor.author	Wan, Y
dc.contributor.author	Zhang, H
dc.contributor.author	Huang, H
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663
dc.contributor.author	Shao, Z
dc.contributor.author	Jin, H
dc.date	2022-03-15
dc.date.accessioned	2023-03-24T03:28:37Z
dc.date.available	2023-03-24T03:28:37Z
dc.date.issued	2022-01-01
dc.identifier.citation	Proceedings - 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, 2022, 00, pp. 601-612
dc.identifier.isbn	9781665437868
dc.identifier.uri	http://hdl.handle.net/10453/168313
dc.description.abstract	Binary- source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.
dc.language	en
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.relation.ispartof	Proceedings - 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022
dc.relation.ispartof	2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
dc.relation.isbasedon	10.1109/SANER53432.2022.00077
dc.rights	info:eu-repo/semantics/embargoedAccess
dc.title	Cross-Language Binary-Source Code Matching with Intermediate Representations
dc.type	Conference Proceeding
utslib.citation.volume	00
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	open_access	*
utslib.copyright.embargo	2024-07-31T00:00:00+1000Z
dc.date.updated	2023-03-24T03:28:36Z
pubs.finish-date	2022-03-18
pubs.publication-status	Published
pubs.start-date	2022-03-15
pubs.volume	00

Abstract:

Binary- source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/168313