THESIS
2017
xii, 95 pages : illustrations ; 30 cm
Abstract
Detecting topics and topic hierarchies from large archives of documents has been one of the
most active research areas in last decade. The objective of topic detection is to discover the
thematic structure underlying document collections, based on which the collections can be
organized and summarized. Recently, hierarchical latent tree analysis (HLTA) is proposed
as a new method for topic detection. It uses a class of graphical models called hierarchical
latent tree models (HLTMs) to build a topic hierarchy. The variables at the bottom level
of an HLTM are binary observed variables that represent the presence/absence of words in
a document. The variables at other levels are binary latent variables that represent word
co-occurrence patterns with different granularities. Each late...[
Read more ]
Detecting topics and topic hierarchies from large archives of documents has been one of the
most active research areas in last decade. The objective of topic detection is to discover the
thematic structure underlying document collections, based on which the collections can be
organized and summarized. Recently, hierarchical latent tree analysis (HLTA) is proposed
as a new method for topic detection. It uses a class of graphical models called hierarchical
latent tree models (HLTMs) to build a topic hierarchy. The variables at the bottom level
of an HLTM are binary observed variables that represent the presence/absence of words in
a document. The variables at other levels are binary latent variables that represent word
co-occurrence patterns with different granularities. Each latent variable gives a soft partition
of the documents, and document clusters in the partitions are interpreted as topics.
HLTA has been shown to discover significantly better models, more coherent topics and
topic hierarchies than the state-of-the-art LDA-based hierarchical topic detection methods.
However, HLTA in its current form can hardly be recognized as a practical topic detection
tool. First, HLTA has rather prohibitive computational cost; Second, HLTA only operates
on binary data. In this thesis, we propose and investigate methods to overcome those short comings.
First, we propose a new learning algorithm PEM-HLTA as to scale up HLTA. HLTA
consists of two phases: model construction phase and parameter estimation phase. The computational bottleneck of HLTA lies in the use of the EM algorithm for evaluating parameters
during model construction phase, which produces a large number of intermediate models.
Here we propose progressive EM (PEM) as a replacement of EM. PEM carries out parameter
evaluation in submodels that involve only 3 or 4 observed binary variables and gains great
speed-up. Combined with the accelerating techniques applied to the parameter estimation
phase, PEM-HLTA achieves efficiency comparable with the best LDA-based hierarchical topic
detection methods, and excels in model predictive performance, topic coherence and topic
hierarchy quality.
Second, we propose an extension HLTA-c as to incorporate word counts into PEM-HLTA.
The incapability of dealing with count data has always put HLTA at a disadvantage as
a topic detection method. We introduce real-valued continuous variables to replace the
observed binary variables in HLTMs and devise a new document generation process. Then
for each document, HLTA-c first samples values for the latent variables layer by layer via logic
sampling, then draws relative frequencies for the words conditioned on the values of the latent
variables, and finally generates words for the document using the relative word frequencies.
Empirical results show that, on count data, HLTA-c achieves drastically better model fit
and yields more meaningful topics and topic hierarchies. It is the new state-of-the-art for
hierarchical topic detection.
Post a Comment