THESIS
2004
x, 60 leaves : ill. ; 30 cm
Abstract
Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series possibly of different lengths....[
Read more ]
Clustering problems are central to many knowledge discovery and data mining tasks. However, most existing clustering methods can only work with fixed-dimensional representations of data patterns. In this paper, we study the clustering of data patterns that are represented as sequences or time series possibly of different lengths.
We first propose a model-based method to this problem using mixtures of autoregressive moving average (ARMA) models. We derive an expectation-max-imization (EM) algorithm for learning the mixing coefficients as well as the parameters of the component models. To address the model selection problem, we use the Bayesian information criterion (BIC) to determine the number of clusters in the data. Experiments were conducted on a number of simulated and real datasets. Results from the experiments show that our method compares favorably with other methods proposed previously by others for similar time series clustering tasks.
In the second part of the thesis, we present an experimental comparison of several distance measures for ARMA models in the context of time series clustering. Four types of ARMA distance measures are considered, including measures defined based on linear predictive coding (LPC) cepstral coefficients, subspace angles, Fisher scores, and information theoretic distance measures. These distance measures are used with both partitional and hierarchical clustering algorithms in our experiments involving both simulated and real electroencephalogram (EEG) data. Experimental results show that information theoretic measures, particularly the Bhattacharya distance measure and the Kullback-Leibler divergence, generally outperform other distance measures in clustering quality as measured by the Rand index.
Post a Comment