Towards efficient and secure large-scale systems for distributed machine learning training

HKUST Electronic Theses

Towards efficient and secure large-scale systems for distributed machine learning training

by Chengliang Zhang

THESIS 2021

Ph.D. Computer Science and Engineering

1 online resource (xiii, 93 pages) : illustrations (some color)

Abstract

Machine learning (ML) techniques have advanced in leaps and bounds in the past decade. Its success critically relies on the abundant computing power and the availability of big data, it is impractical to host ML training on a single machine, and a sole data source usually does not produce a general enough model. By distributing ML workload across multiple machines and utilizing data across multiple silos, we can substantially improve the quality of ML training. As large-scale ML training is increasingly deployed in production systems involving multiple entities, how to improve efficiency, and ensure the confidentiality of the participants become the pressing needs. First, how to efficiently train an ML model in a cluster with the presence of heterogeneity? Second, in the context of federated learning (FL) where multiple data owners collaboratively train a model together, how to mitigate the overhead introduced by the privacy-preserving techniques? Lastly, in the nuance case where many organizations who own data but not ML expertise would like to pool their data and collaborate with those who have expertise (model owner) to train generalizable models, how to protect the model owner’s intellectual property (model privacy) while preserving the data privacy of data owners?

General ML training solutions find themselves inadequate under the efficiency and privacy challenges posed by distributed ML. First, traditional distributed ML systems often conduct asynchronous training to mitigate the impact of stragglers. While it maximizes the training throughput, the price paid is degraded training quality as there are inconsistency across workers. Second, although techniques like Homomorphic Encryption (HE) can be conveniently adopted to preserve data privacy in FL, they induce prohibitively high computation and communication overheads. Third, there is yet to be a practical solution that can protect model owner’s intellectual properties without compromising data owner’s privacy.

To fill in the gaps mentioned above, we profile, analyze, and propose new strategies to improve training efficiency and privacy guarantees.

To improve the efficiency in distributed asynchronous training, we first propose a new distributed synchronization scheme, termed speculative synchronization. Our scheme allows workers to speculate about the recent parameter updates from others on the fly, and if necessary, the workers abort the ongoing computation, pull fresher parameters, and start over to improve the quality of training. We implement our scheme and demonstrate that speculative synchronization achieves substantial speedups over the asynchronous parallel scheme with minimal communication overhead.

Second, we present BatchCrypt, a system solution for cross-silo FL that significantly reduces the encryption and communication overhead caused by HE. Instead of encrypting individual gradients with full precision, we encode a batch of quantized gradients into a long integer and encrypt it in one go. To allow gradient-wise aggregation to be performed on ciphertexts of the encoded batches, we develop new quantization and encoding schemes along with a novel gradient clipping technique. Our evaluations confirm that BatchCrypt can effectively reduce the computation and communication overhead.

Lastly, to address the collaborative learning cases where model privacy is also concerned, we devise a scalable system Citadel. Citadel protects privacy for both data and model owner in untrusted infrastructures with the help of Intel SGX. Citadel performs training across multiple training enclaves running on behalf of data owners and an aggregator enclave on behalf of the model owner. Citadel further establishes a strong information barrier between these enclaves using zero-sum masking or hierarchical aggregation to prevent data/model leakage during collaborative training. We deploy Citadel on cloud to train various ML models, and prove it is scalable while providing strong privacy guarantees.

[ Hide abstract ]

View Copyrighted to the author. Reproduction is prohibited without the author’s prior written consent.

Details

Collection HKUST Electronic Theses Degree Ph.D. Department Computer Science and Engineering Supervisors Wang, Wei Authors Zhang, Chengliang Subjects Machine learning Distributed artificial intelligence Language English Call number Thesis CSE 2021 Zhang DOI 10.14711/thesis-991012936264103412

Full record

Towards efficient and secure large-scale systems for distributed machine learning training

by Chengliang Zhang

Post a Comment Cancel reply