THESIS
2020
xiv, 105 pages : illustrations ; 30 cm
Abstract
Traditionally, when doing data analysis, we often assume that the data are complete and
follow a Gaussian distribution. However, in practice, missing values frequently occurs during
the data observation or recording process. In addition, the distribution of the data in many
applications could have a heavier tail than the Gaussian distribution, due either to the
intrinsic data generation mechanism or to the existence of outliers. Under such scenarios,
the traditional statistical analysis methods based on complete data and Gaussian distribution
are no longer applicable, and new efficient statistical analysis methods for incomplete and
heavy-tailed data need to be designed. The main purpose of this dissertation is to deal with
this challenge and develop efficient parameter estimati...[
Read more ]
Traditionally, when doing data analysis, we often assume that the data are complete and
follow a Gaussian distribution. However, in practice, missing values frequently occurs during
the data observation or recording process. In addition, the distribution of the data in many
applications could have a heavier tail than the Gaussian distribution, due either to the
intrinsic data generation mechanism or to the existence of outliers. Under such scenarios,
the traditional statistical analysis methods based on complete data and Gaussian distribution
are no longer applicable, and new efficient statistical analysis methods for incomplete and
heavy-tailed data need to be designed. The main purpose of this dissertation is to deal with
this challenge and develop efficient parameter estimation methods for two specific problems
with missing values and heavy-tails: robust estimation of the mean and covariance matrix
from incomplete data, and parameter estimation for the heavy-tailed autoregressive (AR)
model from incomplete data.
Robust estimation of the mean and covariance matrix is a fundamental problem in statistical
data analysis. In this dissertation, we consider the robust estimation of the mean and
covariance matrix for incomplete multivariate observations with the monotone missing-data
pattern. First, we develop two efficient numerical algorithms for the existing robust estimator
for the monotone incomplete data, i.e., the maximum likelihood (ML) estimator assuming
Student's t-distributed samples. The proposed algorithms can be more than one order of
magnitude faster than the existing algorithms. Then, to deal with the unreliability and the
inapplicability of the Student's t ML estimator when the number of samples is relatively small
compared with the number of parameters, we propose a regularized robust estimator, which
is defined as the maximizer of a penalized log-likelihood. The penalty term is constructed
with a prior target as its global maximizer, towards which the estimator will shrink the mean
and covariance matrix. Two numerical algorithms are derived for the regularized estimator.
Numerical simulations show the fast convergence rates of the proposed algorithms and the
good estimation accuracy of the proposed regularized estimator.
The autoregressive (AR) model is a widely used model to understand time series data.
Traditionally, the innovation noise of the AR is modeled as Gaussian. However, time series in
many applications are non-Gaussian and have heavy tails, therefore, the AR model with more
general heavy-tailed innovations, the Student's t-distributed innovations, has been proposed.
Although there are numerous works about Gaussian AR time series with missing values,
as far as we know, there does not exist any work addressing the issue of missing data for
the heavy-tailed AR model. This dissertation considers this issue for the first time, and
proposes an efficient framework for parameter estimation from incomplete heavy-tailed time
series based on a stochastic approximation expectation maximization (SAEM) coupled with a
Markov Chain Monte Carlo (MCMC) procedure. The proposed algorithm is computationally
cheap and easy to implement. The convergence of the proposed algorithm to the stationary
points of the ML estimation problem is rigorously proved. Extensive simulations and real
datasets analyses demonstrate the efficacy of the proposed framework.
Post a Comment