THESIS
2015
xiii, 115 pages : illustrations ; 30 cm
Abstract
Cluster Analysis is an important research topic in machine learning and data mining.
The objective is to partition data into meaningful clusters, where the number of clusters
might be known or unknown. A variety of approaches have been proposed for the task.
Examples include distance/similarity-based algorithms such as K-means, kernel K-means
and spectral clustering, as well as model-based methods such as Gaussian mixture models
(GMMs). Most previous clustering algorithms are targeted for continuous data, that is,
points in the Euclidean space of certain dimension. In this thesis, we focus on discrete
data where each attribute takes a finite number of possible values.
A commonly used method for clustering discrete data is latent class analysis (LCA).
LCA is based on a class of...[
Read more ]
Cluster Analysis is an important research topic in machine learning and data mining.
The objective is to partition data into meaningful clusters, where the number of clusters
might be known or unknown. A variety of approaches have been proposed for the task.
Examples include distance/similarity-based algorithms such as K-means, kernel K-means
and spectral clustering, as well as model-based methods such as Gaussian mixture models
(GMMs). Most previous clustering algorithms are targeted for continuous data, that is,
points in the Euclidean space of certain dimension. In this thesis, we focus on discrete
data where each attribute takes a finite number of possible values.
A commonly used method for clustering discrete data is latent class analysis (LCA).
LCA is based on a class of finite mixture models known as latent class models (LCMs).
An LCM consists of a latent variable and a number of attributes. It assumes that the
attributes are mutually independent given the latent variable. Known as local independence,
this assumption is often too strong in practice, and might lead to spurious clusters.
Another method for clustering discrete data is called latent tree analysis (LTA) and it is
based on latent tree models (LTMs). LTMs allow multiple latent variables and hence the
local independence is relaxed. Each latent variable represents a soft partition of data. As
such, LTA yields multiple partitions of data and is a multidimensional clustering method.
While multiple partitions are desirable in some applications, there are many situations
where one wants only one partition of data.
In this thesis, we develop and investigate a novel method for clustering discrete data that yields a single partition of data and does not make the local independence assumption.
The method learns, from data, a three-level LTM where (1) the attributes are at the
bottom, (2) there is a unique latent variable, the root, at the top level, and (3) there are
multiple latent variables at the middle level that connect the attributes and the root. The
root represents the data partition to be obtained. The local independence assumption
is relaxed by the multiple latent variables at the middle level. The method is called
UC-LTM, which stands for unidimensional clustering based on latent tree models.
In addition to the general-purpose algorithm UC-LTM, we also propose and investigate
a special-purpose unidimensional clustering method for solving the rounding problem of
spectral clustering. Spectral clustering starts with a similarity matrix for a collection
of data points, transforms the matrix to get the so-called Laplacian matrix, finds the
eigenvectors of the Laplacian matrix, and obtains a partition of the data points using
the leading eigenvectors. The last step is called rounding and it consists of three sub-problems:
(1) decide how many leading eigenvectors to use, (2) determine the number of
clusters, and (3) partition the data points. To solve the rounding problem, we treat the
eigenvectors as features, discretize them, and then learn a three-level LTM to cluster the
data points. Our method differs from previous rounding methods in three ways. First,
we relax the assumption that the number of clusters equals the number of eigenvectors
used. Second, when deciding how many leading eigenvectors to use, we not only rely
on information contained in the leading eigenvectors themselves, but also make use of
the subsequent eigenvectors. Third, our method is model-based and solves all the three
sub-problems simultaneously.
Post a Comment