THESIS
2022
1 online resource (xii, 100 pages) : illustrations (some color)
Abstract
Performance optimizations on GPUs are not well-understood enough. This thesis
discusses principles and automation of performance optimizations on NVIDIA GPUs,
with a special focus on compute-bound kernels. This thesis focuses on the abstraction
layers between portable virtual instruction sets (e.g., LLVM IR, NVIDIA PTX) and native
hardware assembly.
We first introduce the native GPU instruction set, Shader ASSembly (SASS). Previously,
the public cannot customize SASS generation as the only way to generate SASS
is to use close-sourced proprietary compiler ptxas. ptxas hides many important optimizations
including instruction scheduling. We develop an open-source assembler,
TuringAs, for the public to manipulate SASS. And we identified new optimization opportunities
at SASS level. For inst...[
Read more ]
Performance optimizations on GPUs are not well-understood enough. This thesis
discusses principles and automation of performance optimizations on NVIDIA GPUs,
with a special focus on compute-bound kernels. This thesis focuses on the abstraction
layers between portable virtual instruction sets (e.g., LLVM IR, NVIDIA PTX) and native
hardware assembly.
We first introduce the native GPU instruction set, Shader ASSembly (SASS). Previously,
the public cannot customize SASS generation as the only way to generate SASS
is to use close-sourced proprietary compiler ptxas. ptxas hides many important optimizations
including instruction scheduling. We develop an open-source assembler,
TuringAs, for the public to manipulate SASS. And we identified new optimization opportunities
at SASS level. For instance, using some native SASS instructions helps to
reduce register pressure and reordering SASS instructions leads to better instructionlevel
parallelism thus increasing throughput. We evaluate the effectiveness of our optimizations
with the examples of Winograd convolution (a fast convolution algorithm)
and Tensor Core matrix multiplication.
Next, we introduce our effort to automate SASS optimizations to promote productivity.
Programming in SASS doesn’t scale to a large number of kernels nor new GPU
architectures. We develop GASS, an LLVM-based compiler that translates high-level
virtual representation (i.e., LLVM IR) to optimized SASS automatically. We highlight
our newly proposed instruction scheduler for compute-bound deep learning kernels, our
customization of the if-conversion pass, and our algorithms to resolve data dependency.
The evaluation shows that our algorithms in GASS outperform LLVM’s algorithms by a
considerable margin and GASS is on-par with highly optimized proprietary compiler
ptxas.
Post a Comment