Latent tree analysis for hierarchical topic detection : scalability and count data

HKUST Electronic Theses

Latent tree analysis for hierarchical topic detection : scalability and count data

by Peixian Chen

THESIS 2017

Ph.D. Computer Science and Engineering

xii, 95 pages : illustrations ; 30 cm

Abstract

Detecting topics and topic hierarchies from large archives of documents has been one of the most active research areas in last decade. The objective of topic detection is to discover the thematic structure underlying document collections, based on which the collections can be organized and summarized. Recently, hierarchical latent tree analysis (HLTA) is proposed as a new method for topic detection. It uses a class of graphical models called hierarchical latent tree models (HLTMs) to build a topic hierarchy. The variables at the bottom level of an HLTM are binary observed variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables that represent word co-occurrence patterns with different granularities. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics.

HLTA has been shown to discover significantly better models, more coherent topics and topic hierarchies than the state-of-the-art LDA-based hierarchical topic detection methods. However, HLTA in its current form can hardly be recognized as a practical topic detection tool. First, HLTA has rather prohibitive computational cost; Second, HLTA only operates on binary data. In this thesis, we propose and investigate methods to overcome those short comings.

First, we propose a new learning algorithm PEM-HLTA as to scale up HLTA. HLTA consists of two phases: model construction phase and parameter estimation phase. The computational bottleneck of HLTA lies in the use of the EM algorithm for evaluating parameters during model construction phase, which produces a large number of intermediate models. Here we propose progressive EM (PEM) as a replacement of EM. PEM carries out parameter evaluation in submodels that involve only 3 or 4 observed binary variables and gains great speed-up. Combined with the accelerating techniques applied to the parameter estimation phase, PEM-HLTA achieves efficiency comparable with the best LDA-based hierarchical topic detection methods, and excels in model predictive performance, topic coherence and topic hierarchy quality.

Second, we propose an extension HLTA-c as to incorporate word counts into PEM-HLTA. The incapability of dealing with count data has always put HLTA at a disadvantage as a topic detection method. We introduce real-valued continuous variables to replace the observed binary variables in HLTMs and devise a new document generation process. Then for each document, HLTA-c first samples values for the latent variables layer by layer via logic sampling, then draws relative frequencies for the words conditioned on the values of the latent variables, and finally generates words for the document using the relative word frequencies. Empirical results show that, on count data, HLTA-c achieves drastically better model fit and yields more meaningful topics and topic hierarchies. It is the new state-of-the-art for hierarchical topic detection.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Zhang, Nevin L. Authors Chen, Peixian Subjects Data mining Mathematical models Statistical methods Electronic data processing Latent structure analysis Language English Call number Thesis CSED 2017 ChenP DOI 10.14711/thesis-991012554867603412

Full record

Latent tree analysis for hierarchical topic detection : scalability and count data

by Peixian Chen

Post a Comment Cancel reply