Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

Wu, L; Du, L; Liu, B; Xu, G; Ge, Y; Fu, Y; Li, J; Zhou, Y; Xiong, H

Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

Wu, L Du, L Liu, B Xu, G Ge, Y Fu, Y Li, J Zhou, Y Xiong, H

Permalink

Publisher:: IEEE
Publication Type:: Conference
Citation:: Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM), 2014, pp. 610 - 619
Issue Date:: 2014-12-14

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download full textAdobe PDF (367.98 kB)

View on publisher's site

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wu, L
dc.contributor.author	Du, L
dc.contributor.author	Liu, B
dc.contributor.author	Xu, G
dc.contributor.author	Ge, Y
dc.contributor.author	Fu, Y
dc.contributor.author	Li, J
dc.contributor.author	Zhou, Y
dc.contributor.author	Xiong, H
dc.contributor.editor	Kumar,, R
dc.date	2014-12-14
dc.date.accessioned	2015-05-01T01:05:31Z
dc.date.issued	2014-12-14
dc.identifier.citation	Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM), 2014, pp. 610 - 619
dc.identifier.isbn	978-1-4799-4303-6
dc.identifier.issn	1550-4786
dc.identifier.uri	http://hdl.handle.net/10453/35464
dc.description.abstract	The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.
dc.publisher	IEEE
dc.relation.ispartof	ICDM
dc.relation.ispartof	Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM)
dc.relation.ispartof	2014 IEEE International Conference on Data Mining (ICDM)
dc.relation.hasversion	Accepted manuscript version	en_US
dc.relation.isbasedon	10.1109/ICDM.2014.147
dc.relation.isreplacedby	10453/33737
dc.relation.isreplacedby	http://hdl.handle.net/10453/33737
dc.rights	© 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	en_US
dc.title	Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval
dc.type	Conference
utslib.location	Piscataway, USA
dc.location.activity	Shenzhen, China
pubs.embargo.period	Not known
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
utslib.copyright.status	open_access
pubs.declined	1970-01-01T00:00:00.0+1000
pubs.consider-herdc	true
pubs.merge-to	10453/33737
pubs.merge-to	http://hdl.handle.net/10453/33737
pubs.deleted	1970-01-01T00:00:00.0+1000

Abstract:

The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/35464