THESIS
2019
xvi, 264 pages : illustrations ; 30 cm
Abstract
Dealing with large-scale dataset and increasingly complex model has been a big challenge for
optimization methods in machine learning. The stochastic gradient descent (SGD) method has
been widely viewed as an ideal approach for large-scale machine learning problems while the
conventional batch gradient method typically falters. Despite its flexibility and scalability, the
stochastic gradient is associated with high variance which impedes training. In this thesis, we
introduce new optimization methods with a focus on fundamental challenges such as scalability,
computational and communication efficiency, for tackling the large-scale machine learning tasks.
The first part of this thesis introduces optimization methods for optimizing empirical risk minimization
(ERM) problems, based...[
Read more ]
Dealing with large-scale dataset and increasingly complex model has been a big challenge for
optimization methods in machine learning. The stochastic gradient descent (SGD) method has
been widely viewed as an ideal approach for large-scale machine learning problems while the
conventional batch gradient method typically falters. Despite its flexibility and scalability, the
stochastic gradient is associated with high variance which impedes training. In this thesis, we
introduce new optimization methods with a focus on fundamental challenges such as scalability,
computational and communication efficiency, for tackling the large-scale machine learning tasks.
The first part of this thesis introduces optimization methods for optimizing empirical risk minimization
(ERM) problems, based on the principle of variance reduction. We firstly propose a
fast and scalable stochastic ADMM method for solving ERM problem with complex nonsmooth
regularizers such as overlapping graph lasso and group lasso. Using similar principles, we also
introduce a stochastic continuation method to optimize convex problems where loss and regularizer
are both nonsmooth. While the existing incremental gradient approaches rely crucially on the
assumption that the dataset is finite, we develop two SGD-like algorithms for the finite sums with
infinite data (data augmentation). The proposed algorithms outperform existing methods in terms
of both iteration complexity and storage.
The second part of this thesis investigates adaptive gradient methods for solving optimization
problems in deep learning. Inspired by the recent advancement in adaptive gradient methods
for training deep neural networks, we present a fast and powerful optimization algorithm called
the follow-the-moving-leader (FTML) method. The new algorithm significantly outperforms the
existing approaches, thereby advancing the state-of-the-art results. As coordinate-wise adaptive
methods, including FTML, can generalize worse than SGD methods in some cases, we propose
blockwise adaptivity, which balances the adaptivity and the generalization. Both theoretically and
empirically, it improves convergence and generalization over the coordinate-wise counterpart.
The last part of this thesis studies two critical aspects of distributed optimization methods –
asynchronicity and communication efficiency. We develop a distributed asynchronous gradient-based
method with fast linear convergence rate for strongly convex ERM problems, improving
upon the existing distributed machine learning algorithms. On the other hand, the scalability of
large-scale distributed training of deep neural networks is often limited by the communication
overhead. Motivated by the recent advances in optimization with compressed gradient, we propose
a communication-efficient distributed blockwise momentum SGD with error-feedback. The proposed
method provably converges to a stationary point at the same asymptotic rate as distributed
synchronous momentum SGD while achieving a nearly 32x reduction on communication cost.
Post a Comment