THESIS
2022
1 online resource (xii, 117 pages) : illustrations (some color)
Abstract
With the proliferation of data emerges a myriad of dataflow frameworks. When they
are deployed in a datacenter and productized as a service, their performance and cost
become two primary concerns. However, performance issues prevail in dataflow computation.
Their diagnosis is complicated by the heterogeneity of dataflow frameworks
because the frameworks differ in underlying design, application domain, and computation
complexity. It poses challenges for service providers and users to debug and locate
the problems. A side effect of performance issues is higher resource costs as the datacenter
operator cannot easily determine the appropriate allocation that could guarantee
stable performance, thus leading to unwanted resource waste.
To tackle the challenges of performance and cost, the dis...[
Read more ]
With the proliferation of data emerges a myriad of dataflow frameworks. When they
are deployed in a datacenter and productized as a service, their performance and cost
become two primary concerns. However, performance issues prevail in dataflow computation.
Their diagnosis is complicated by the heterogeneity of dataflow frameworks
because the frameworks differ in underlying design, application domain, and computation
complexity. It poses challenges for service providers and users to debug and locate
the problems. A side effect of performance issues is higher resource costs as the datacenter
operator cannot easily determine the appropriate allocation that could guarantee
stable performance, thus leading to unwanted resource waste.
To tackle the challenges of performance and cost, the dissertation first characterizes
dataflow computation in a large datacenter by analyzing a recently released workload
trace. It examines the static properties of job DAGs and the runtime characteristics of
their task execution. Statically, the DAGs are discovered to exhibit high artificiality when
compared with random graphs. The dependent tasks may have significant variability in
resource usage and duration—–even for recurring tasks. The results confirm the challenge of performance debugging and resource allocation.
To diagnose performance issues, the dissertation enables resource observability in dataflow
computation by proposing CrystalPerf, a new approach that learns to characterize the performance
of dataflow computation based on code analysis. It requires no code instrumentation
and applies to a wide variety of dataflow frameworks. Our key insight is that the
source code of an operation contains learnable syntactic and semantic patterns that reveal
how it uses resources. Our approach establishes a performance-resource model that,
given a dataflow program, infers automatically how much time each operation has spent
on each resource (e.g., CPU, network, disk) from past execution traces and the program
source code, using machine learning techniques. Extensive evaluations and real-world
case studies show that CrystalPerfcan predict job performance and accurately detect runtime
bottlenecks of DAG jobs.
To reduce resource costs, the dissertation proposed Owl, an overcommitted scheduler
for executing dataflow computation on serverless platforms. It achieves high utilization
without compromising performance with a dual approach. (1) For less-invoked functions,
it allocates resources to the sandboxes with usage-based heuristic, keeps monitoring their
performance, and remedies any detected degradation. (2) For frequently-invoked functions,
Owl profiles the interference patterns among collocated functions and places the
sandboxes under the guidance of profiles. Owl further consolidates idle sandboxes to
reduce resource waste. We prototype OWL in our production system and implement a
representative benchmark suite to evaluate it. The results demonstrate that the prototype
could reduce VM cost by 43.80% and effectively mitigate latency degradation, with negligible
overhead incurred.
Post a Comment