Deep learning has demonstrated advanced performance in various computer vision and natural
language processing tasks over the past decade. Deep learning models are now fundamental
building blocks for applications including autonomous driving, cloud video analytics, sentiment
analysis, and natural language inference. To achieve high accuracy for demanding tasks, deep
neural networks grow rapidly in size and computation complexity and require high-fidelity and
large volumes of data, making training and inference time-consuming and costly. These challenges
become salient and motivate practitioners to focus on building machine learning systems.
In recent years, the intersection of the traditional computer systems and machine learning topics
has attracted considerable research attention, inc...[
Read more ]
Deep learning has demonstrated advanced performance in various computer vision and natural
language processing tasks over the past decade. Deep learning models are now fundamental
building blocks for applications including autonomous driving, cloud video analytics, sentiment
analysis, and natural language inference. To achieve high accuracy for demanding tasks, deep
neural networks grow rapidly in size and computation complexity and require high-fidelity and
large volumes of data, making training and inference time-consuming and costly. These challenges
become salient and motivate practitioners to focus on building machine learning systems.
In recent years, the intersection of the traditional computer systems and machine learning topics
has attracted considerable research attention, including applying machine learning techniques or
learned policies in system designs (i.e., machine learning for systems) and optimizing systems
especially for machine learning pipelines and workloads (i.e., systems for machine learning).
Combining both, research on using machine learning techniques to optimize machine learning
systems shows significant efficiency improvements for exploiting the inherent mechanism of
learning tasks.
This dissertation provides and discusses new directions to optimize the speed, accuracy, and
system overhead of machine learning training and inference systems in different applications
using learning-based techniques. We find that aligning system designs and machine learning
workloads can let systems prioritize the data, neural network parameters, and computation that
machine learning tasks really need to improve performance, e.g., achieving high-quality edge-cloud
video analytics with low bandwidth consumption using optimized video data that preserves
the necessary information, reducing models’ training computation volume by focusing
on under-trained parameters, and adaptively assigning less model capacity for simpler natural
language queries with real-time semantic understanding. With three case studies ranging from
training to inference and from computer vision to natural language processing, we show that using
learning-based techniques to optimize the design of machine learning systems can precisely
benefit the efficiency of machine learning applications.
First, we propose and analyze Runespoor, an edge-cloud video analytics system using super-resolution
to manage the accuracy loss with compressed data over the network. Emerging deep
learning-based video analytics tasks, e.g., object detection and semantic segmentation, demand
computation-intensive neural networks and powerful computing resources on the cloud to achieve
high inference accuracy. Due to the latency requirement and limited network bandwidth, edge-cloud
systems adaptively compress the data to strike a balance between overall analytics accuracy
and bandwidth consumption. However, the degraded data leads to another issue of poor
tail accuracy, which means the extremely low accuracy of a few semantic classes and video
frames. Modern applications like autonomous robotics especially value the tail accuracy performance,
but suffer using the prior edge-cloud systems. Our analytics-aware super-resolution extends
super-resolution, which is an effective technique that learns a mapping from low-resolution
frames to high-resolution frames. Runespoor can reconstruct high-resolution frames tailored for
the tail accuracy performance of video analytics tasks with augmented details from compressed
low-resolution data on the server. Our evaluation shows that Runespoor improves class-wise
tail accuracy by up to 300%, frame-wise 90%/99% tail accuracy by up to 22%/54%, and greatly
improves the overall accuracy and bandwidth trade-off.
Next, we explore Egeria, a knowledge-guided deep learning training system that employs
semantic knowledge from a reference model and knowledge distillation techniques to accelerate
model training by accurately evaluating individual layers’ training progress, safely freezing
the converged ones, and saving their corresponding backward computation and communication.
Training deep neural networks is time-consuming. While most existing efficient training solutions
try to overlap/schedule computation and communication, Egeria goes one step further by
skipping them through layer freezing. The key insight is that the training progress of internal
neural network layers differs significantly, and front layers often become well-trained much earlier
than deep layers. To explore this, we introduce the notion of training plasticity to quantify
the training progress of layers. Informed by the latest knowledge distillation research, we use
a reference model that is generated on the fly with quantization techniques and runs forward
operations asynchronously on available CPUs to minimize the overhead. Our experiments with
popular vision and language models show that Egeria achieves 19%-43% training speedup w.r.t.
the state-of-the-art without sacrificing accuracy.
Finally, we present Tabi, an inference system with a multi-level inference engine optimized
for large language models and diverse workloads by exploring the prediction confidence of neural
networks and the Transformer’s attention mechanism. Today’s trend of building ever larger
language models, while pushing the performance of natural language processing, adds significant
latency to the inference stage. We observe that due to the diminishing returns of adding
model parameters, a smaller model could make the same prediction as a costly large language
model for a majority of queries. Based on this observation, we design Tabi that can serve queries
using both small models and optional large ones, unlike the traditional one-model-for-all pattern.
Tabi is optimized for discriminative models (i.e., not generative LLMs) in a serving framework.
Tabi uses the calibrated confidence score to decide whether to return the accurate results of small
models extremely fast or re-route them to large models. For re-routed queries, it uses attention-based
word pruning and weighted ensemble techniques to offset the system overhead and accuracy
loss. Tabi achieves 21%-40% average latency reduction (with comparable tail latency)
over the state-of-the-art while meeting top-grade high accuracy targets.
Post a Comment