THESIS
2012
xiii, 134 p. : ill. ; 30 cm
Abstract
Latent tree models are a class of probabilistic graphical models. These models have a tree structure, in which the internal nodes represent latent variables whereas the leaf nodes represent manifest variables. They allow fast inference and have been used for multidimensional clustering, latent structure discovery, density estimation, and classification....[
Read more ]
Latent tree models are a class of probabilistic graphical models. These models have a tree structure, in which the internal nodes represent latent variables whereas the leaf nodes represent manifest variables. They allow fast inference and have been used for multidimensional clustering, latent structure discovery, density estimation, and classification.
This thesis makes two contributions to the research of latent tree models. The first contribution is a new application of latent tree models in spectral clustering. In spectral clustering, one defines a similarity matrix for a collection of data points, transforms the matrix to get the so-called Laplacian matrix, finds the eigenvectors of the Laplacian matrix, and obtains a partition of the data points using the leading eigenvectors. The last step is sometimes referred to as rounding, where one needs to decide how many leading eigenvectors to use, to determine the number of clusters, and to partition the data points.
We propose to use latent tree models for rounding. The method differs from previous rounding methods in three ways. First, we relax the assumption that the number of clusters equals the number of eigenvectors used. Second, when deciding how many leading eigenvectors to use, we not only rely on information contained in the leading eigenvectors themselves, but also make use of the subsequent eigenvectors. Third, our method is model-based and solves all the three subproblems of rounding using latent tree models. We evaluate our method on both synthetic and real-world data. The results show that our method works correctly in the ideal case where between-clusters similarity is 0, and degrades gracefully as one moves away from the ideal case.
The second contribution is an extension to latent tree models. While latent tree models have been shown to be useful in data analysis, they contain only discrete variables and are thus limited to discrete data. One has to resort to discretization if analysis on continuous data is needed. However, this leads to loss of information and may make the resulting models harder to interpret.
We propose an extended class of models, called pouch latent tree models, that allow the leaf nodes to represent continuous variables. This extension allows the models to work on continuous data directly. These models also generalize Gaussian mixture models, which are commonly used for model-based clustering on continuous data. We use pouch latent tree models for facilitating variable selection in clustering, and demonstrate on some benchmark data that it is more appropriate to facilitate variable selection than to perform variable selection traditionally. We further demonstrate the usefulness of the models by performing multidimensional clustering on some real-world basketball data. Our results exhibit multiple meaningful clusterings and interesting relationships between the clusterings.
Post a Comment