Detecting Duplicate Questions in Stack Overflow via Source Code Modeling

Gao, W; Wu, J; Xu, G

Detecting Duplicate Questions in Stack Overflow via Source Code Modeling

Gao, W Wu, J Xu, G

Permalink

Publisher:: World Scientific Publishing
Publication Type:: Journal Article
Citation:: International Journal of Software Engineering and Knowledge Engineering, 2022, 32, (2), pp. 227-255
Issue Date:: 2022-02-01

Closed Access

	Filename	Description	Size
	20763134_10667041540005671.pdf	Published version	3.77 MB		View/Open

Copyright Clearance Process

Recently Added
In Progress
Closed Access

This item is closed access and not available.

Full metadata record

Field	Value	Language
dc.contributor.author	Gao, W
dc.contributor.author	Wu, J
dc.contributor.author	Xu, G https://orcid.org/0000-0003-4493-6663
dc.date.accessioned	2023-05-14T05:53:23Z
dc.date.available	2023-05-14T05:53:23Z
dc.date.issued	2022-02-01
dc.identifier.citation	International Journal of Software Engineering and Knowledge Engineering, 2022, 32, (2), pp. 227-255
dc.identifier.issn	0218-1940
dc.identifier.issn	1793-6403
dc.identifier.uri	http://hdl.handle.net/10453/170329
dc.description.abstract	Stack Overflow is one of the most popular Question-Answering sites for programmers. However, it faces the problem of question duplication, where newly created questions are identical to previous questions. Existing works on duplicate question detection in Stack Overflow extract a set of textual features on the question pairs and use supervised learning approaches to classify duplicate question pairs. However, they do not consider the source code information in the questions. While in some cases, the intention of a question is mainly represented by the source code. In this paper, we aim to learn the semantics of a question by combining both text features and source code features. We use word embedding and convolutional neural networks to extract textual features from questions to overcome the lexical gap issue. We use tree-based convolutional neural networks to extract structural and semantic features from source code. In addition, we perform multi-task learning by combining the duplication question detection task with a question tag prediction side task. We conduct extensive experiments on the Stack Overflow dataset and show that our approach can detect duplicate questions with higher recall and MRR compared with baseline approaches on Python and Java programming languages.
dc.language	en
dc.publisher	World Scientific Publishing
dc.relation.ispartof	International Journal of Software Engineering and Knowledge Engineering
dc.relation.isbasedon	10.1142/S0218194022500073
dc.rights	info:eu-repo/semantics/closedAccess
dc.subject	0801 Artificial Intelligence and Image Processing, 0803 Computer Software
dc.subject.classification	Software Engineering
dc.title	Detecting Duplicate Questions in Stack Overflow via Source Code Modeling
dc.type	Journal Article
utslib.citation.volume	32
utslib.for	0801 Artificial Intelligence and Image Processing
utslib.for	0803 Computer Software
pubs.organisational-group	/University of Technology Sydney
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology
pubs.organisational-group	/University of Technology Sydney/Strength - AAI - Advanced Analytics Institute Research Centre
pubs.organisational-group	/University of Technology Sydney/Faculty of Engineering and Information Technology/School of Computer Science
utslib.copyright.status	closed_access	*
dc.date.updated	2023-05-14T05:53:20Z
pubs.issue	2
pubs.publication-status	Published
pubs.volume	32
utslib.citation.issue	2

Abstract:

Stack Overflow is one of the most popular Question-Answering sites for programmers. However, it faces the problem of question duplication, where newly created questions are identical to previous questions. Existing works on duplicate question detection in Stack Overflow extract a set of textual features on the question pairs and use supervised learning approaches to classify duplicate question pairs. However, they do not consider the source code information in the questions. While in some cases, the intention of a question is mainly represented by the source code. In this paper, we aim to learn the semantics of a question by combining both text features and source code features. We use word embedding and convolutional neural networks to extract textual features from questions to overcome the lexical gap issue. We use tree-based convolutional neural networks to extract structural and semantic features from source code. In addition, we perform multi-task learning by combining the duplication question detection task with a question tag prediction side task. We conduct extensive experiments on the Stack Overflow dataset and show that our approach can detect duplicate questions with higher recall and MRR compared with baseline approaches on Python and Java programming languages.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/170329