THESIS
2021
1 online resource (xiii, 93 pages) : illustrations (some color)
Abstract
Machine learning (ML) techniques have advanced in leaps and bounds in the past
decade. Its success critically relies on the abundant computing power and the availability
of big data, it is impractical to host ML training on a single machine, and a sole data
source usually does not produce a general enough model. By distributing ML workload
across multiple machines and utilizing data across multiple silos, we can substantially
improve the quality of ML training. As large-scale ML training is increasingly deployed in
production systems involving multiple entities, how to improve efficiency, and ensure the
confidentiality of the participants become the pressing needs. First, how to efficiently train
an ML model in a cluster with the presence of heterogeneity? Second, in the context of
fede...[
Read more ]
Machine learning (ML) techniques have advanced in leaps and bounds in the past
decade. Its success critically relies on the abundant computing power and the availability
of big data, it is impractical to host ML training on a single machine, and a sole data
source usually does not produce a general enough model. By distributing ML workload
across multiple machines and utilizing data across multiple silos, we can substantially
improve the quality of ML training. As large-scale ML training is increasingly deployed in
production systems involving multiple entities, how to improve efficiency, and ensure the
confidentiality of the participants become the pressing needs. First, how to efficiently train
an ML model in a cluster with the presence of heterogeneity? Second, in the context of
federated learning (FL) where multiple data owners collaboratively train a model together,
how to mitigate the overhead introduced by the privacy-preserving techniques? Lastly,
in the nuance case where many organizations who own data but not ML expertise would
like to pool their data and collaborate with those who have expertise (model owner) to
train generalizable models, how to protect the model owner’s intellectual property (model
privacy) while preserving the data privacy of data owners?
General ML training solutions find themselves inadequate under the efficiency and privacy challenges posed by distributed ML. First, traditional distributed ML systems
often conduct asynchronous training to mitigate the impact of stragglers. While it maximizes
the training throughput, the price paid is degraded training quality as there are
inconsistency across workers. Second, although techniques like Homomorphic Encryption
(HE) can be conveniently adopted to preserve data privacy in FL, they induce prohibitively
high computation and communication overheads. Third, there is yet to be a
practical solution that can protect model owner’s intellectual properties without compromising
data owner’s privacy.
To fill in the gaps mentioned above, we profile, analyze, and propose new strategies to
improve training efficiency and privacy guarantees.
To improve the efficiency in distributed asynchronous training, we first propose a new
distributed synchronization scheme, termed speculative synchronization. Our scheme allows
workers to speculate about the recent parameter updates from others on the fly, and
if necessary, the workers abort the ongoing computation, pull fresher parameters, and start
over to improve the quality of training. We implement our scheme and demonstrate that
speculative synchronization achieves substantial speedups over the asynchronous parallel
scheme with minimal communication overhead.
Second, we present BatchCrypt, a system solution for cross-silo FL that significantly
reduces the encryption and communication overhead caused by HE. Instead of encrypting
individual gradients with full precision, we encode a batch of quantized gradients into a long
integer and encrypt it in one go. To allow gradient-wise aggregation to be performed on
ciphertexts of the encoded batches, we develop new quantization and encoding schemes
along with a novel gradient clipping technique. Our evaluations confirm that BatchCrypt
can effectively reduce the computation and communication overhead.
Lastly, to address the collaborative learning cases where model privacy is also concerned,
we devise a scalable system Citadel. Citadel protects privacy for both data and
model owner in untrusted infrastructures with the help of Intel SGX. Citadel performs
training across multiple training enclaves running on behalf of data owners and an aggregator
enclave on behalf of the model owner. Citadel further establishes a strong information
barrier between these enclaves using zero-sum masking or hierarchical aggregation to prevent
data/model leakage during collaborative training. We deploy Citadel on cloud to train
various ML models, and prove it is scalable while providing strong privacy guarantees.
Post a Comment