THESIS
2015
xiv, 112 pages : illustrations ; 30 cm
Abstract
Defect prediction on new software projects or projects with limited historical data is an interesting
problem in software engineering. It is difficult to collect defect information to label a dataset for
training a prediction model. This is a known problem, defect prediction on unlabeled datasets.
Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction
models built by other projects that have enough historical data. However, CPDP may not always
build a strong prediction model because of the different distributions among datasets. In addition,
existing approaches for defect prediction using only unlabeled datasets have one major limitation,
the necessity for manual effort.
To address the limitations such as the different distributions am...[
Read more ]
Defect prediction on new software projects or projects with limited historical data is an interesting
problem in software engineering. It is difficult to collect defect information to label a dataset for
training a prediction model. This is a known problem, defect prediction on unlabeled datasets.
Cross-project defect prediction (CPDP) has tried to address this problem by reusing prediction
models built by other projects that have enough historical data. However, CPDP may not always
build a strong prediction model because of the different distributions among datasets. In addition,
existing approaches for defect prediction using only unlabeled datasets have one major limitation,
the necessity for manual effort.
To address the limitations such as the different distributions among datasets and the necessity
for manual effort, we propose three techniques that can build prediction models on unlabeled
datasets. First, we propose TCA+ that improves the prediction performance of CPDP by adopting
transfer component analysis (TCA). TCA+ is an extended TCA to suggest the most appropriate
normalization technique before applying TCA for CPDP. Second, we propose heterogeneous defect
prediction (HDP) that enables cross-project defect prediction on projects with different metric
sets. HDP matches metrics that have similar distributions between datasets used in CPDP. Lastly,
we propose CLAMI that enables defect prediction by using unlabeled datasets. The key idea of the
CLAMI approach is to label an unlabeled dataset by using the magnitude of metric values.
Our proposed techniques, TCA+, HDP, and CLAMI, address the limitations for defect prediction
on unlabeled datasets. However, the three techniques still have challenging issues to be
addressed. We also discuss them as future work.
Post a Comment