THESIS
2020
1 online resource (xi, 87 pages) : illustrations (some color)
Abstract
High-dimensionality is one of the most challenging problems that has arisen in the past
decade. Although data mining technology has been greatly developed, new challenges
still emerge with respect to specific data structures. In order to discover previously
unknown patterns and make predictions, we have to overcome these challenges. Moreover,
when interactions among explanatory variables are taken into account, the dimensionality
becomes even larger. Thus, feature selection is a hot topic in terms of supervised and
unsupervised learning.
In Essay 1 of this dissertation, we consider the business data mining problem, using the
Amazon employee’s access as an example, to demonstrate the proposed feature selection
and classification methods. First, when we apply Naive Bayes classifi...[
Read more ]
High-dimensionality is one of the most challenging problems that has arisen in the past
decade. Although data mining technology has been greatly developed, new challenges
still emerge with respect to specific data structures. In order to discover previously
unknown patterns and make predictions, we have to overcome these challenges. Moreover,
when interactions among explanatory variables are taken into account, the dimensionality
becomes even larger. Thus, feature selection is a hot topic in terms of supervised and
unsupervised learning.
In Essay 1 of this dissertation, we consider the business data mining problem, using the
Amazon employee’s access as an example, to demonstrate the proposed feature selection
and classification methods. First, when we apply Naive Bayes classifiers to the data set,
the classifiers are modified step-by-step with ideas of Empirical Bayes, grouping, and
migration. Second, we propose a three-stage Bayesian hierarchical model with regards to
the special data structure. Also, because of the categorical structure, we propose a method
for variable selection: Coefficient of Dependence (CoD). Finally, ensemble learning is used
to bring together the classifiers as a whole. When carrying out the procedure, a technique
that we refer to as Stringing is applied. The newly-developed classifiers outperform most
of the existing models in terms of the ranking of the competition.
Essay 2 contains a clustering analysis model, referred to as Beta-binomial mixture
model. This idea comes from the classic Gaussian Mixture Model (GMM), as a method
of distribution-based clustering. In distribution-based clustering, objects are clustered
based on their similarities to the same distribution. An Expectation-maximization (EM)
algorithm is used to fulfill the unsupervised model.
Post a Comment