Path-sensitive code embedding via contrastive learning for software vulnerability detection

Cheng, X; Zhang, G; Wang, H; Sui, Y

Path-sensitive code embedding via contrastive learning for software vulnerability detection

Cheng, X Zhang, G Wang, H Sui, Y

Permalink

Publisher:: Association for Computing Machinery (ACM)
Publication Type:: Conference Proceeding
Citation:: ISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 519-531
Issue Date:: 2022-07-18

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download Accepted versionAdobe PDF (1.17 MB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Cheng, X
dc.contributor.author	Zhang, G
dc.contributor.author	Wang, H
dc.contributor.author	Sui, Y https://orcid.org/0000-0002-9510-6574
dc.date.accessioned	2023-02-24T03:26:17Z
dc.date.available	2023-02-24T03:26:17Z
dc.date.issued	2022-07-18
dc.identifier.citation	ISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 519-531
dc.identifier.isbn	9781450393799
dc.identifier.uri	http://hdl.handle.net/10453/166413
dc.description.abstract	Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection. This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.
dc.language	en
dc.publisher	Association for Computing Machinery (ACM)
dc.relation.ispartof	ISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
dc.relation.ispartof	Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
dc.relation.isbasedon	10.1145/3533767.3534371
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	"2022 ACM YEAR. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and AnalysisJuly 2022 Pages 519–531https://doi.org/10.1145/3533767.3534371
dc.title	Path-sensitive code embedding via contrastive learning for software vulnerability detection
dc.type	Conference Proceeding
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAII - Australian Artificial Intelligence Institute
utslib.copyright.status	open_access	*
dc.date.updated	2023-02-24T03:26:16Z
pubs.publication-status	Published

Abstract:

Machine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability detection as an alternative to conventional bug detection methods. To obtain the structural information of code, current learning approaches typically abstract a program in the form of graphs (e.g., data-flow graphs, abstract syntax trees), and then train an underlying classification model based on the (sub)graphs of safe and vulnerable code fragments for vulnerability prediction. However, these models are still insufficient for precise bug detection, because the objective of these models is to produce classification results rather than comprehending the semantics of vulnerabilities, e.g., pinpoint bug triggering paths, which are essential for static bug detection. This paper presents ContraFlow, a selective yet precise contrastive value-flow embedding approach to statically detect software vulnerabilities. The novelty of ContraFlow lies in selecting and preserving feasible value-flow (aka program dependence) paths through a pretrained path embedding model using self-supervised contrastive learning, thus significantly reducing the amount of labeled data required for training expensive downstream models for path-based vulnerability detection. We evaluated ContraFlow using 288 real-world projects by comparing eight recent learning-based approaches. ContraFlow outperforms these eight baselines by up to 334.1%, 317.9%, 58.3% for informedness, markedness and F1 Score, and achieves up to 450.0%, 192.3%, 450.0% improvement for mean statement recall, mean statement precision and mean IoU respectively in terms of locating buggy statements.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/166413