Detecting Duplicate Questions in Stack Overflow via Source Code Modeling
- Publisher:
- World Scientific Publishing
- Publication Type:
- Journal Article
- Citation:
- International Journal of Software Engineering and Knowledge Engineering, 2022, 32, (2), pp. 227-255
- Issue Date:
- 2022-02-01
Closed Access
Filename | Description | Size | |||
---|---|---|---|---|---|
20763134_10667041540005671.pdf | Published version | 3.77 MB |
Copyright Clearance Process
- Recently Added
- In Progress
- Closed Access
This item is closed access and not available.
Stack Overflow is one of the most popular Question-Answering sites for programmers. However, it faces the problem of question duplication, where newly created questions are identical to previous questions. Existing works on duplicate question detection in Stack Overflow extract a set of textual features on the question pairs and use supervised learning approaches to classify duplicate question pairs. However, they do not consider the source code information in the questions. While in some cases, the intention of a question is mainly represented by the source code. In this paper, we aim to learn the semantics of a question by combining both text features and source code features. We use word embedding and convolutional neural networks to extract textual features from questions to overcome the lexical gap issue. We use tree-based convolutional neural networks to extract structural and semantic features from source code. In addition, we perform multi-task learning by combining the duplication question detection task with a question tag prediction side task. We conduct extensive experiments on the Stack Overflow dataset and show that our approach can detect duplicate questions with higher recall and MRR compared with baseline approaches on Python and Java programming languages.
Please use this identifier to cite or link to this item: