Static Analysis-Guided Automatic Source Code Summarization via Deep Learning

Wang, Wenhua

Static Analysis-Guided Automatic Source Code Summarization via Deep Learning

Wang, Wenhua

Permalink

Publication Type:: Thesis
Issue Date:: 2021

Open Access

Copyright Clearance Process

Recently Added
In Progress
Open Access

This item is open access.

Adobe PDF

Download contents and abstractAdobe PDF (290.08 kB)

Adobe PDF

Download thesisAdobe PDF (2.27 MB)

View statistics

Full metadata record

Field	Value	Language
dc.contributor.author	Wang, Wenhua
dc.date.accessioned	2022-12-18T23:41:01Z
dc.date.available	2022-12-18T23:41:01Z
dc.date.issued	2021
dc.identifier.uri	http://hdl.handle.net/10453/164470
dc.description	University of Technology Sydney. Faculty of Engineering and Information Technology.	en_US.UTF-8
dc.description.abstract	Code summarization provides high level natural language description of the function, which can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, the existing research can be mainly categorized as template-based approaches, information-retrieval-based approaches and deep-learning-based approaches. Recently, with the development of deep learning and its widely utilization, neural machine translation (NMT) structure has been introduced to the research of code summarization. Based on our study, most state-of-the-art deep-learning-based approaches follow an encoder-decoder framework which encodes the code into hidden space and then decode it into natural language space. However, due to the special grammar and syntax structure of programming languages and various shortcomings of different deep neural networks, the accuracy of existing code summarization approaches is not high enough. These approaches mainly suffering from three major drawbacks: a) They consider the sequential content of code, ignoring the structure which is also critical for the comprehension of code; b) They only consider the generation of the code's intent, while ignore the information of parameters etc which is also quite important for the understanding and usage of the source code. c) Their adopted CNN/RNN model usually cause long-distance dependency and excessive computation cost problem. Considering this status, the main research work of this thesis are as follows: (1) the first work presents a code summarization approach using hierarchical attention network by incorporating multiple code features, which are injected into a deep reinforcement learning (DRL) framework (e.g., actor-critic network) for comment generation. (2) While many existing approaches exploit inadequate power of statement-wise semantic contributions for augmenting their performance, the second work propose the transformer-based generative adversarial network framework for universal code summarization which constructs a cross-language universal hierarchical semantic (UHS) model to classify statements by positioning them in source code. (3) Consider that almost all approaches only consider to generate the general intent of the method without documenting their parameters, the third work proposes to generate both the method comment and the parameter comment to provide complete java documentation for the code snippets. Specifically, it designs a programming-analysis-based component to extract UseSet of parameter and the KeySet in the code snippet to obtain the main semantic information and discard the useless noise information and utilizes the copy-attention-integrated transformer based NMT framework. Through the completion of this thesis, a set of experimental studies are conducted, where the experimental results suggest that our proposed approaches outperform multiple state-of-the-art approaches.	en_US.UTF-8
dc.format	Thesis (PhD)
dc.language.iso	en_US	en_US.UTF-8
dc.relation	https://opus.lib.uts.edu.au/bitstream/10453/164470/2/02whole.pdf
dc.rights	The author owns the copyright in this thesis including all reproduction and reuse rights for the work. The work may not be altered without the permission of the copyright owner. Attribution is essential when quoting or paraphrasing from this thesis.
dc.rights	au.edu.uts.lib/ppc
dc.rights	info:eu-repo/semantics/openAccess
dc.title	Static Analysis-Guided Automatic Source Code Summarization via Deep Learning	en_US.UTF-8
dc.type	Thesis
utslib.copyright.status	open_access	*

Abstract:

Code summarization provides high level natural language description of the function, which can benefit the software maintenance, code categorization and retrieval. To the best of our knowledge, the existing research can be mainly categorized as template-based approaches, information-retrieval-based approaches and deep-learning-based approaches. Recently, with the development of deep learning and its widely utilization, neural machine translation (NMT) structure has been introduced to the research of code summarization. Based on our study, most state-of-the-art deep-learning-based approaches follow an encoder-decoder framework which encodes the code into hidden space and then decode it into natural language space. However, due to the special grammar and syntax structure of programming languages and various shortcomings of different deep neural networks, the accuracy of existing code summarization approaches is not high enough. These approaches mainly suffering from three major drawbacks: a) They consider the sequential content of code, ignoring the structure which is also critical for the comprehension of code; b) They only consider the generation of the code's intent, while ignore the information of parameters etc which is also quite important for the understanding and usage of the source code. c) Their adopted CNN/RNN model usually cause long-distance dependency and excessive computation cost problem. Considering this status, the main research work of this thesis are as follows: (1) the first work presents a code summarization approach using hierarchical attention network by incorporating multiple code features, which are injected into a deep reinforcement learning (DRL) framework (e.g., actor-critic network) for comment generation. (2) While many existing approaches exploit inadequate power of statement-wise semantic contributions for augmenting their performance, the second work propose the transformer-based generative adversarial network framework for universal code summarization which constructs a cross-language universal hierarchical semantic (UHS) model to classify statements by positioning them in source code. (3) Consider that almost all approaches only consider to generate the general intent of the method without documenting their parameters, the third work proposes to generate both the method comment and the parameter comment to provide complete java documentation for the code snippets. Specifically, it designs a programming-analysis-based component to extract UseSet of parameter and the KeySet in the code snippet to obtain the main semantic information and discard the useless noise information and utilizes the copy-attention-integrated transformer based NMT framework. Through the completion of this thesis, a set of experimental studies are conducted, where the experimental results suggest that our proposed approaches outperform multiple state-of-the-art approaches.

Please use this identifier to cite or link to this item:

http://hdl.handle.net/10453/164470