Detecting Duplicate Questions in Stack Overflow via Source Code Modeling

Publisher:
World Scientific Publishing
Publication Type:
Journal Article
Citation:
International Journal of Software Engineering and Knowledge Engineering, 2022, 32, (2), pp. 227-255
Issue Date:
2022-02-01
Filename Description Size
20763134_10667041540005671.pdfPublished version3.77 MB
Adobe PDF
Full metadata record
Stack Overflow is one of the most popular Question-Answering sites for programmers. However, it faces the problem of question duplication, where newly created questions are identical to previous questions. Existing works on duplicate question detection in Stack Overflow extract a set of textual features on the question pairs and use supervised learning approaches to classify duplicate question pairs. However, they do not consider the source code information in the questions. While in some cases, the intention of a question is mainly represented by the source code. In this paper, we aim to learn the semantics of a question by combining both text features and source code features. We use word embedding and convolutional neural networks to extract textual features from questions to overcome the lexical gap issue. We use tree-based convolutional neural networks to extract structural and semantic features from source code. In addition, we perform multi-task learning by combining the duplication question detection task with a question tag prediction side task. We conduct extensive experiments on the Stack Overflow dataset and show that our approach can detect duplicate questions with higher recall and MRR compared with baseline approaches on Python and Java programming languages.
Please use this identifier to cite or link to this item: