Data representation methods and applications in bioinformatics data analysis

HKUST Electronic Theses

Data representation methods and applications in bioinformatics data analysis

by Qing Liao

THESIS 2016

Ph.D. Computer Science and Engineering

ix, 122 pages : illustrations ; 30 cm

Abstract

Nowadays, bioinformatics has become a hot topic in big data analysis as the community has collected large amount of data. It is important to introduce modern data analysis methods to process such large amount of bioinformatics data. In data analysis, data representation recovers the latent structure of data and significantly enhances the performance of subsequent processing, and thus plays an important role in the main task. Conventional data representation approaches, such as principal component analysis (PCA, [148]) and Fisher's linear discriminant analysis (FLDA, [127]), intrinsically recover the mainly effective axes according to the probabilistic distribution of data. They are mainly built on the statistical property of the data. Currently, deep learning [166] has been proposed to learn effective representation from large-scale datasets, and since the learned representation can recover the nonlinear relationship among data points, it has quite high effectiveness, and being achieving more and more successes in bioinformatics.

Since some bioinformatics data are non-negative and such non-negativity has special physical meaning, it is necessary to take into account such non-negativity property in modern bioinformatics applications. Non-negative matrix factorization (NMF, [97, 95]) is a recently proposed data representation method which decomposes a given non-negative data matrix into the product of two lower-rank non-negative factor matrices. The learned matrices can be treated as a representation of the data and their representing coefficients, respectively. It has been shown a powerful tool for various practical applications because it is consistent with the psychological and physical evidence in human brain [116][105][144]. Due to its simplicity and effectiveness, NMF has been extended to meet the requirements of various practical tasks.

In this thesis, we focus on the data representation problems in bioinformatics. Our main works include five parts:

(1) We introduce models and algorithms of non-negative matrix factorization to determine developmental stage ranges in drosophila embryonic based on gene expression pattern recognition. To this end, we first proposed a Logdet divergence based sparse NMF (LDS-NMF) to handle the rank-deficiency problem to enhance the accuracy result of stage ranges determination. Then we developed a multiplicative update rule (MUR) to solve LDS-NMF and theoretically prove its convergence. Experimental results show that LDS-NMF is promising in gene expression image of drosophila embryonic datasets and that LDS-NMF has good classification result to determine the enbryonic development stage. Interestedly, the LDS-NMF method can be applied in other tasks such as face recognition, and its effectiveness can be evaluated.

(2) We introduce models and algorithms for biomedical data analysis. We first propose a bi-graph regularized NMF (BIGNMF) to predict potential drug-target interactions on four biological datasets, and found that BIGNMF outperforms the state-of-arts methods. However, since the similarities between drugs and targets are expensive and sometimes deteriorate the performance, we apply the matrix completion analysis (MCA) method to drug-target prediction, and confirm the intelligibility of this method by validating most of the predicted drug-target pairs in the public databases. To take the advantages of both BIGNMF and MCA, we apply the weighted NMF method (WNMF) in drug-target prediction and show that it performs the best.

(3) We introduce models and algorithms of non-negative matrix factorization for gene expression data analysis. We propose a novel Gauss-Seidel based NMF method (GSNMF) to overcome the imbalance deficiency between gene features and tumor samples and evaluate its effectiveness on several biological datasets of cancer diseases. GSNMF alternatively projects the data matrix onto the subspace spanned by one factor matrix and updates another factor matrix by using the Gauss-Seidel method. Since GSNMF normalizes both factor matrices in each iteration round, its solution approximately optimizes the original problem while overcome the imbalance deficiency. The experimental results show its promise.

(4) We introduce a deep learning method for gene expression based cancer diagnosis. Since it is difficult to collect data for a particular cancer, it is impossible to train an effective deep neural network for each cancer. Therefore, we proposed a multi-task deep learning method (MTDL) to classify multiple cancers simultaneously and enhance the classification performance of each cancer by leveraging knowledge through shared layers. With the help of knowledge transfer, the classification performance of cancer with limited samples will be significantly boosted. The experimental results on twelve cancers evaluate the effectiveness of MTDL.

(5) We introduce several NMF methods for pattern recognition tasks inspiring by the success of LDS-NMF in gene expression pattern recognition and the success of GSNMF in gene expression clustering. We first proposed a correntropy induced metric (CIM) based local coordinate NMF method (RLC-NMF) to induce sparse representation and robustness in NMF to enhance its robustness. We then proposed local coordinate graph regularized NMF (LCG-NMF) to incorporate the geometric structure of the dataset into NMF. We developed multiplicative update rules to solve both RLC-NMF and LCG-NMF. The experimental results of image clustering evaluate their effectiveness.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Zhang, Qian Authors Liao, Qing Subjects Bioinformatics Analysis Data processing Biology Language English Call number Thesis CSED 2016 Liao DOI 10.14711/thesis-b1731771

Full record

Data representation methods and applications in bioinformatics data analysis

by Qing Liao

Post a Comment Cancel reply