Large scale optimization methods for machine learning

HKUST Electronic Theses

Large scale optimization methods for machine learning

by Shuai Zheng

THESIS 2019

Ph.D. Computer Science and Engineering

xvi, 264 pages : illustrations ; 30 cm

Abstract

Dealing with large-scale dataset and increasingly complex model has been a big challenge for optimization methods in machine learning. The stochastic gradient descent (SGD) method has been widely viewed as an ideal approach for large-scale machine learning problems while the conventional batch gradient method typically falters. Despite its flexibility and scalability, the stochastic gradient is associated with high variance which impedes training. In this thesis, we introduce new optimization methods with a focus on fundamental challenges such as scalability, computational and communication efficiency, for tackling the large-scale machine learning tasks.

The first part of this thesis introduces optimization methods for optimizing empirical risk minimization (ERM) problems, based on the principle of variance reduction. We firstly propose a fast and scalable stochastic ADMM method for solving ERM problem with complex nonsmooth regularizers such as overlapping graph lasso and group lasso. Using similar principles, we also introduce a stochastic continuation method to optimize convex problems where loss and regularizer are both nonsmooth. While the existing incremental gradient approaches rely crucially on the assumption that the dataset is finite, we develop two SGD-like algorithms for the finite sums with infinite data (data augmentation). The proposed algorithms outperform existing methods in terms of both iteration complexity and storage.

The second part of this thesis investigates adaptive gradient methods for solving optimization problems in deep learning. Inspired by the recent advancement in adaptive gradient methods for training deep neural networks, we present a fast and powerful optimization algorithm called the follow-the-moving-leader (FTML) method. The new algorithm significantly outperforms the existing approaches, thereby advancing the state-of-the-art results. As coordinate-wise adaptive methods, including FTML, can generalize worse than SGD methods in some cases, we propose blockwise adaptivity, which balances the adaptivity and the generalization. Both theoretically and empirically, it improves convergence and generalization over the coordinate-wise counterpart.

The last part of this thesis studies two critical aspects of distributed optimization methods – asynchronicity and communication efficiency. We develop a distributed asynchronous gradient-based method with fast linear convergence rate for strongly convex ERM problems, improving upon the existing distributed machine learning algorithms. On the other hand, the scalability of large-scale distributed training of deep neural networks is often limited by the communication overhead. Motivated by the recent advances in optimization with compressed gradient, we propose a communication-efficient distributed blockwise momentum SGD with error-feedback. The proposed method provably converges to a stationary point at the same asymptotic rate as distributed synchronous momentum SGD while achieving a nearly 32x reduction on communication cost.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Kwok, James T. Authors Zheng, Shuai Subjects Machine learning Mathematical models Statistical methods Learning models (Stochastic processes) Multipliers (Mathematical analysis) Language English Call number Thesis CSED 2019 ZhengS DOI 10.14711/thesis-991012757567403412

Full record

Large scale optimization methods for machine learning

by Shuai Zheng

Post a Comment Cancel reply