THESIS
2019
xiv, 111 pages : illustrations ; 30 cm
Abstract
The default method to study application-architecture interactions is cycle-accurate simulation.
Statistical simulation is an alternative method that approaches these interactions from a
different angle than time. It has been demonstrated that statistical simulation offers new possibilities
to substantially speed up the simulation. The common way to build statistical simulation
is using the reuse distance (RD) memory locality model. Unfortunately, the RD model can capture
only a single locality granularity, such as the cache-line locality. This limitation leads to a
considerably high error when evaluating multi-level caches. In addition, RD alone is only suitable
to model single-core applications. Therefore, existing statistical simulators lack effective
memory locality models fo...[
Read more ]
The default method to study application-architecture interactions is cycle-accurate simulation.
Statistical simulation is an alternative method that approaches these interactions from a
different angle than time. It has been demonstrated that statistical simulation offers new possibilities
to substantially speed up the simulation. The common way to build statistical simulation
is using the reuse distance (RD) memory locality model. Unfortunately, the RD model can capture
only a single locality granularity, such as the cache-line locality. This limitation leads to a
considerably high error when evaluating multi-level caches. In addition, RD alone is only suitable
to model single-core applications. Therefore, existing statistical simulators lack effective
memory locality models for multiprocessor applications and often neglect data-sharing between
threads. Moreover, the typical method to speed up statistical simulations is to blindly reduce
the trace length to be synthesized. While this gives good control over the speedup, it leaves
the simulation error unbounded. In this thesis, we address these issues. We first introduce a
generalization to the RD that can capture the locality seen at multiple granularities. We refer to
it as hierarchical reuse distance (HRD). Our results show that HRD is 4X more accurate than
RD when simulating single-core systems with multi-level caches. HRD also converges three
orders of magnitude faster than RD. The second contribution is a novel s̲h̲a̲ring-l̲o̲cality m̲odel
(Shalom). Shalom can capture and reproduce data-sharing in multithread applications. Lastly,
the third contribution is a method to bound the statistical simulation error for a particular metric while maximizing the speedup. We achieve this by monitoring the convergence of the statistical
synthesis. We name it c̲o̲n̲vergence-d̲e̲termiṉistic s̱imulation (Condens). In a set of experiments,
the combination of Shalom and Condens is on average 234X faster than cycle-accurate simulations,
with simulation error of 15.4%. Our approach is also 48X faster than state-of-the-art
sampling simulation under the same accuracy level. Compared to statistical simulators ignoring
sharing, our technique is 3x more accurate for performance metrics and 5x more accurate for
cache miss estimations.
Post a Comment