THESIS
2015
xiv, 115 pages : illustrations ; 30 cm
Abstract
Latent tree models (LTMs) are tree-structured probabilistic graphical models where
the variables at leaf nodes are observed and the variables at internal nodes are latent. They
represent complex relationships among observed variables and yet are computationally
simple to work with. When used for data analysis, they can identify meaningful groupings
of observed variables, where the correlations among the variables in each group can be
modeled by a single latent variable. As such, LTMs are effective tools for detecting
correlation patterns in data and can be applied to various domains. In this thesis, we
focus on developing new algorithms for learning latent tree models and investigating their
use in two applications.
Firstly, we propose a new method called Bridged-Islands (BI) a...[
Read more ]
Latent tree models (LTMs) are tree-structured probabilistic graphical models where
the variables at leaf nodes are observed and the variables at internal nodes are latent. They
represent complex relationships among observed variables and yet are computationally
simple to work with. When used for data analysis, they can identify meaningful groupings
of observed variables, where the correlations among the variables in each group can be
modeled by a single latent variable. As such, LTMs are effective tools for detecting
correlation patterns in data and can be applied to various domains. In this thesis, we
focus on developing new algorithms for learning latent tree models and investigating their
use in two applications.
Firstly, we propose a new method called Bridged-Islands (BI) algorithm for learning
latent tree models. Many algorithms have been proposed for the same task. However,
previous methods for learning latent tree models have their limitations. For example,
the search algorithms can only handle data sets with dozens of attributes which limits
their use in processing large data set. The algorithms based on attribute clustering (AC)
usually focus on latent tree models with special structures (e.g., binary trees or forests)
which may be not appropriate for many real world applications. The algorithms motivated
by phylogenetic tree reconstruction (PTR) have restrictions on both observed and latent
variables. They normally require that all observed or latent variables have the same
number of states and the number should be known beforehand. Compared with AC-based algorithms and PTR-motivated algorithms, the advantage of BI algorithm is model
quality. It produces models with better quality and has no restrictions on the cardinalities
of the observed variables or latent variables. Compared with search algorithms, BI is more
efficient and can be employed in larger data sets.
Secondly, we propose a new LTM-based algorithm called hierarchical latent tree
analysis (HLTA) for identifying topics from a collection of documents. The predominant
method for topic detection is latent Dirichlet allocation (LDA) (Blei et al., 2003), which
assumes a generating process for the documents. HLTA provides an alternative view for
topic detection. It identifies the topics by simultaneously clustering the documents and
word variables. In this new view, a topic is considered as a class of documents that show
clear thematic meaning. The new method models the word co-occurrence patterns by
using a hierarchical latent tree model. Each latent variable represents a soft partition of
the documents based on some word co-occurrence patterns. Each state of the latent
variable, which represents a cluster of documents in the partition, is interpreted as a
topic. In the hierarchical structure, the variables at a higher level correspond to more
general topics, while the variables at a lower level correspond to more specific topics.
Compared with alternative methods, HLTA integrates topic correlation modeling, topic
structure detection and automatic determination of topic numbers in one framework.
Empirical experiments also show that the new algorithm produces more semantically
coherent topics and topic hierarchy than LDA-based methods.
Thirdly, we explore the use of LTMs for multidimensional clustering. Conventional
clustering usually assumes that there is only one true clustering within a data set, which
does not hold for many real-world data that are multifaceted. Multidimensional
clustering aims to identify multiple clusterings from the data. Latent tree models have
previously been used for multidimensional clustering, where each latent variable in the
model represents one way to cluster the data. In this thesis, we apply the new algorithm
to the task and simultaneously deal with much larger data sets than previously. We
compared LTM-based methods with several alternative algorithms for multidimensional
clustering on both labeled and unlabeled data sets. The empirical experiments indicate
that the LTM-based methods can identify a rich set of meaningful clusterings and
concurrently achieve good performance.
Post a Comment