THESIS
2017
xii, 94 pages : illustrations ; 30 cm
Abstract
Developers often wonder how to implement a program functionality. Code examples are very
helpful in this regard. Over the years, many approaches have been proposed to generate code
examples. The existing approaches often treat queries and source code as textual documents and
utilize information retrieval models to retrieve relevant code snippets that match a given query.
However, conventional code example generation approaches involve the following major challenges.
First, they rely on a bag-of-words assumption and cannot recognize high-level features of
queries and source code. Second, source code and natural language queries are heterogeneous. Existing
approaches mainly rely on the textual similarity between source code and natural language
query. They lack a mapping of high-l...[
Read more ]
Developers often wonder how to implement a program functionality. Code examples are very
helpful in this regard. Over the years, many approaches have been proposed to generate code
examples. The existing approaches often treat queries and source code as textual documents and
utilize information retrieval models to retrieve relevant code snippets that match a given query.
However, conventional code example generation approaches involve the following major challenges.
First, they rely on a bag-of-words assumption and cannot recognize high-level features of
queries and source code. Second, source code and natural language queries are heterogeneous. Existing
approaches mainly rely on the textual similarity between source code and natural language
query. They lack a mapping of high-level semantics between queries and source code. Moreover,
the generated code examples may be redundant and project-specific, this requires to generate
succinct and high-coverage code examples.
To address these challenges, in this thesis, we propose three machine learning based approaches
to the generation of code examples. Instead of mapping keywords, our approaches learn the deep
semantics of queries and code snippets.
We first propose a technique, DeepAPI which generates API usage sequences via deep learning.
DEEPAPI adapts a neural language model named RNN Encoder-Decoder [31]. Given a corpus of annotated API sequences, i.e.,〈API sequence, annotation〉pairs, DEEPAPI trains the language
model that encodes each sequence of words (annotation) into a fixed-length context vector and
decodes an API sequence based on the context vector. Then, in response to an API-related user
query, it generates API sequences by consulting the neural language model.
Furthermore, we propose a technique, DeepCodeHow to generate code examples via searching
from existing code corpus. To bridge the lexical gap between queries and source code, DeepCodeHow jointly embeds code snippets and natural language descriptions into a high-dimensional
vector space. With the unified vector representation, code snippets semantically related to a natural
language query can be retrieved according to their vectors.
Finally, to generate succinct and high-coverage examples, we design a code example selection
technique named CodeKernel. CodeKernel leverages a machine learning technique named Graph
Kernel. It represents code snippets as object usage graphs and embeds graphs into a high-level
vector space. With the graph embedding, CodeKernel clusters similar graphs and selects a typical
graph as the code example.
We empirically evaluate our techniques on a large scale code corpus collected from GitHub. The
experimental results show that our proposed techniques effectively generate relevant code examples
and outperform the conventional IR-based approaches.
Post a Comment