Nowadays, bioinformatics has become a hot topic in big data analysis as the community has
collected large amount of data. It is important to introduce modern data analysis methods
to process such large amount of bioinformatics data. In data analysis, data representation
recovers the latent structure of data and significantly enhances the performance of subsequent
processing, and thus plays an important role in the main task. Conventional data
representation approaches, such as principal component analysis (PCA, [148]) and Fisher's
linear discriminant analysis (FLDA, [127]), intrinsically recover the mainly effective axes
according to the probabilistic distribution of data. They are mainly built on the statistical
property of the data. Currently, deep learning [166] has been prop...[
Read more ]
Nowadays, bioinformatics has become a hot topic in big data analysis as the community has
collected large amount of data. It is important to introduce modern data analysis methods
to process such large amount of bioinformatics data. In data analysis, data representation
recovers the latent structure of data and significantly enhances the performance of subsequent
processing, and thus plays an important role in the main task. Conventional data
representation approaches, such as principal component analysis (PCA, [148]) and Fisher's
linear discriminant analysis (FLDA, [127]), intrinsically recover the mainly effective axes
according to the probabilistic distribution of data. They are mainly built on the statistical
property of the data. Currently, deep learning [166] has been proposed to learn effective representation
from large-scale datasets, and since the learned representation can recover the nonlinear relationship among data points, it has quite high effectiveness, and being achieving more and more successes in bioinformatics.
Since some bioinformatics data are non-negative and such non-negativity has special
physical meaning, it is necessary to take into account such non-negativity property in modern
bioinformatics applications. Non-negative matrix factorization (NMF, [97, 95]) is a recently
proposed data representation method which decomposes a given non-negative data matrix
into the product of two lower-rank non-negative factor matrices. The learned matrices can
be treated as a representation of the data and their representing coefficients, respectively.
It has been shown a powerful tool for various practical applications because it is consistent
with the psychological and physical evidence in human brain [116][105][144]. Due to its simplicity and effectiveness, NMF has been extended to meet the requirements of various practical tasks.
In this thesis, we focus on the data representation problems in bioinformatics. Our main
works include five parts:
(1) We introduce models and algorithms of non-negative matrix factorization to determine
developmental stage ranges in drosophila embryonic based on gene expression pattern
recognition. To this end, we first proposed a Logdet divergence based sparse NMF (LDS-NMF)
to handle the rank-deficiency problem to enhance the accuracy result of stage ranges
determination. Then we developed a multiplicative update rule (MUR) to solve LDS-NMF
and theoretically prove its convergence. Experimental results show that LDS-NMF is promising
in gene expression image of drosophila embryonic datasets and that LDS-NMF has good
classification result to determine the enbryonic development stage. Interestedly, the LDS-NMF
method can be applied in other tasks such as face recognition, and its effectiveness
can be evaluated.
(2) We introduce models and algorithms for biomedical data analysis. We first propose a
bi-graph regularized NMF (BIGNMF) to predict potential drug-target interactions on four
biological datasets, and found that BIGNMF outperforms the state-of-arts methods. However,
since the similarities between drugs and targets are expensive and sometimes deteriorate
the performance, we apply the matrix completion analysis (MCA) method to drug-target
prediction, and confirm the intelligibility of this method by validating most of the predicted
drug-target pairs in the public databases. To take the advantages of both BIGNMF and
MCA, we apply the weighted NMF method (WNMF) in drug-target prediction and show
that it performs the best.
(3) We introduce models and algorithms of non-negative matrix factorization for gene
expression data analysis. We propose a novel Gauss-Seidel based NMF method (GSNMF) to
overcome the imbalance deficiency between gene features and tumor samples and evaluate its
effectiveness on several biological datasets of cancer diseases. GSNMF alternatively projects
the data matrix onto the subspace spanned by one factor matrix and updates another factor
matrix by using the Gauss-Seidel method. Since GSNMF normalizes both factor matrices
in each iteration round, its solution approximately optimizes the original problem while
overcome the imbalance deficiency. The experimental results show its promise.
(4) We introduce a deep learning method for gene expression based cancer diagnosis.
Since it is difficult to collect data for a particular cancer, it is impossible to train an effective
deep neural network for each cancer. Therefore, we proposed a multi-task deep learning
method (MTDL) to classify multiple cancers simultaneously and enhance the classification
performance of each cancer by leveraging knowledge through shared layers. With the help
of knowledge transfer, the classification performance of cancer with limited samples will be
significantly boosted. The experimental results on twelve cancers evaluate the effectiveness
of MTDL.
(5) We introduce several NMF methods for pattern recognition tasks inspiring by the
success of LDS-NMF in gene expression pattern recognition and the success of GSNMF in
gene expression clustering. We first proposed a correntropy induced metric (CIM) based local
coordinate NMF method (RLC-NMF) to induce sparse representation and robustness in
NMF to enhance its robustness. We then proposed local coordinate graph regularized NMF
(LCG-NMF) to incorporate the geometric structure of the dataset into NMF. We developed
multiplicative update rules to solve both RLC-NMF and LCG-NMF. The experimental results
of image clustering evaluate their effectiveness.
Post a Comment