Deep Learning (DL) applications are widely deployed in diverse areas, such as image
classification, natural language processing, and auto-driving systems. Although these
applications achieve outstanding performance in certain metrics like accuracy, developers
have raised strong concerns about their reliability since the logic of DL applications is
a black box for humans. Specifically, DL applications learn their logic during stochastic
training and encode it in high-dimensional weights of DL models. Unlike source code
in conventional software, such weights are infeasible for humans to directly interpret,
examine, and validate. As a result, the reliability issues in DL applications are not easy
to detect and may cause catastrophic accidents in safety-critical missions. Therefore, it
is c...[
Read more ]
Deep Learning (DL) applications are widely deployed in diverse areas, such as image
classification, natural language processing, and auto-driving systems. Although these
applications achieve outstanding performance in certain metrics like accuracy, developers
have raised strong concerns about their reliability since the logic of DL applications is
a black box for humans. Specifically, DL applications learn their logic during stochastic
training and encode it in high-dimensional weights of DL models. Unlike source code
in conventional software, such weights are infeasible for humans to directly interpret,
examine, and validate. As a result, the reliability issues in DL applications are not easy
to detect and may cause catastrophic accidents in safety-critical missions. Therefore, it
is critical to adequately assess the reliability of DL applications.
This thesis aims to help software developers assess the reliability of DL applications
from the following three perspectives.
The first study proposes object-relevancy, a property that reliable DL-based image
classifiers should comply with, i.e., the classification results should be made based on
the features relevant to the target object in a given image, instead of irrelevant features
such as the background. This study further proposes an automatic approach based on two
metamorphic relations to assess if this property is violated in the image classifications. The
evaluation shows that the proposed approach can effectively detect unreliable inferences
violating the object-relevancy property, with an average precision of 64.1% and 96.4%
for the two relations, respectively. The subsequent empirical study reveals that such
unreliable inferences are prevalent in the real world and the existing training strategies
cannot tackle this issue effectively.
The second study concentrates on the reliability issues induced by DL model compression.
DL model compression can significantly reduce the sizes of Deep Neural Network
(DNN) models, and thus facilitate the deployment of sophisticated, sizable DNN models.
However, the prediction results of compressed models may deviate from those of their
original models, resulting in unreliably deployed DL applications. To help developers
thoroughly assess the impact of model compression, it is essential to test these models to
find any deviated behaviors before dissemination. This study proposes D
FLARE, a novel,
search-based, black-box testing technique. The evaluation shows that D
FLARE constantly
outperforms the baseline in both efficacy and efficiency. More importantly, the triggering
inputs found by D
FLARE can be used to repair up to 48.48% of deviated behaviors.
The third study reveals the unreliable assessment of DL-based Program Generators
(DLGs) in compiler testing. To effectively test compilers, DLGs are proposed to automatically
generate massive testing programs. However, after thorough analysis of the
characteristics of DLGs, this study found that the assessment of these DLGs is unfair
and unreliable, since the chosen baselines, i.e., Language-Specific Program Generators
(LSGs), are different from DLGs in many aspects. Furthermore, this study proposed
Kitten, a simple, fair, and non-DL-based baseline for DLGs. The experiments show that
DLGs cannot even compete against such a simple baseline and the claimed advantages
of DLGs are likely due to the biased selection of the baseline. Specifically, Kitten triggers
1,750 hang bugs and 34 distinct crashes in 72-hours of testing on GCC, while the
the-state-of-art DLG only triggers 3 hang bugs and 1 distinct crash. Moreover, the code
coverage achieved by Kitten is at least 2x as of that achieved by the the-state-of-art DLG.
Post a Comment