THESIS
2010
xi, 56 p. : ill. ; 30 cm
Abstract
Hierarchical naive Bayes (HNB) model [21] is useful in latent variable discovery and classification. It introduces latent variables to a naive Bayes (NB) model to represent the potential conditional dependencies among attribute variables that are not captured by the class variable. Hence HNBs can model more complex relationships among the attributes and alleviate the disadvantages of NB models for classification....[
Read more ]
Hierarchical naive Bayes (HNB) model [21] is useful in latent variable discovery and classification. It introduces latent variables to a naive Bayes (NB) model to represent the potential conditional dependencies among attribute variables that are not captured by the class variable. Hence HNBs can model more complex relationships among the attributes and alleviate the disadvantages of NB models for classification.
More formally, HNB models are tree-shaped Bayesian networks (BNs) where the root represents the class variable, leaf nodes represent attributes, while internal nodes represent latent variables.
This thesis is concerned with the problem of learning HNB models from the data. General learning algorithms for HNB models are computational expensive and thus cannot be applied to large scale problems. In this thesis, we propose a new efficient algorithm. Our new algorithm builds the model structure in a bottom-up procedure. At each iteration, it first applies the conditional mutual information (CMI) to measure the correlations among variables given the class variable. It then selects a strongly correlated subset of variables using a modified version of Unidimensionality Test (UT) and adds a latent variable as their common parent. Specially, we follow these two key points (i.e., CMI and UT) and call our new algorithm as CUT.
We empirical study CUT on different data sets, synthetic data, UCI data, and real-world data. The results show that CUT is significantly faster than previous algorithm, particularly, the difference is greater when the sample size of data is larger. At the meanwhile, it does not significantly compromise the model quality, its performance in latent variable discovery and classification problem.
Post a Comment