Brad Beckmann, from AMD research
March 6, 2015
Motivation: CPU frequency basically plateaued around 2005, where we hit the power wall. Can see that if we plot the power density against time; we peaked around 1 Watt per square millimeter. Power and energy have become the limiting factors for all of our computing. Not just in heat, but battery life and such as well.
We have two approaches to improve energy efficiency. * Increase parallelism. Normalized energy per operation is nonlinear (in a bad way) for single-core systems. * Reduce overhead. I-cache, register file and such use a lot more energy than we actually use to perform an addition, for example.
So heterogeneous processors. In particular, GPUs which are very efficient for massively parallel operations. Tons of threads, and reduced overhead via simple core architecture.
GPUs are kind of hard to use though.
We still need CPUs, because GPUs are bad for plenty of use cases. HSA addresses the second and third points, which are artifacts of how we've traditionally used GPUs.
AMD Kaveri and HSA.
HSA tightly couples a CPU and GPU. Main points:
The traditional discrete GPU has its own memory which we talk to via PCIe. The requisite explicit copy is high-latency and low-bandwidth. We need to run big computations to amortize that copy cost. Further, GPU memory tends to be small.
So we rip out the PCIe link and give the CPU and GPU a shared coherent memory. The GPU uses the same virtual addresses as the CPU, and we don't need to explicitly copy data between them.
(Aside: in this implementation the GPU has its own TLBs, and an IOMMU is used to handle GPU page faults on the CPU.)
Traditionally, running tasks on the GPU involves going through a number of software layers, and it's very expensive. With HSA, user-level software talks directly to the hardware via HSA AQL. No drivers involved, because the hardware understands AQL (which is vendor-independent).
This allows the hardware to schedule tasks itself, and hugely reduce dispatch overhead. Interesting implication: GPU kernels may queue new tasks themselves.
There are some vendor-specific libraries supporting use of HSA. You've got a helper library that the user interacts with, which feeds the core runtime and eventually feeds things into the finalizer which generates GPU code. HSAIL is the intermediate representation which users feed into the runtime libraries.
There are multiple high level compilers which can create and consume HSAIL today, including Clang/LLVM.
Kaveri implements HSA.
A quick benchmark: Kaveri consistently gets 3-5x as much performance as a 4-core CPU in a binary search application. This application on a legacy (non-HSA) APU is really bad because the overheads kill it.
That is, how we enable a shared coherent memory model efficiently.
We're concerned with the second point here.
Parallel synchronization semantics:
Scopes bound synchronization. Smaller scope (eg thread-local) requires less synchronization overhead. Conveniently, scoping like this is already baked into today's GPU programming models.
These levels map quite conveniently to levels of the memory hierarchy. Wavefront is L1, component in L2, etc.
First step of a coherent operation in today's GPUs is to invalidate the L1, since there's no fine-grained coherence in L1. Say we're doing this at component scope, so invalidate the cache to force a read at component scope (from L2). Caches are write-through, so we know we'll get latest data in and push it out correctly.
The same thing in workgroup scope is even easier: just hit L1. So hitting workgroup-private memory is super-efficient. (Writes still get written through, but the workgroup-private write need not wait for the write to complete before continuing).
Static local sharing with this kind of scoping is really great. Dynamic sharing between workgroups is less good. Workgroup sync on current hardware provides an easy ~20% performance boost.
Formalization: "Heterogeneous-Race-Free Memory Models", Hower et al, ASPLOS 2014.
Transitivity matters. In an HRF-direct memory model if A syncs B and B syncs C, A does not sync C. This can be more efficient. Must make a tradeoff between simplicity and performance when doing this, though.
Not so good if there is dynamic local sharing- some threads access shared data less frequently and dynamically. Things like work-stealing.
Insight: if wg1 wants to access wg0's data, it needs to trigger scope promotion of wg0's scope. Essentially, ask wg0 to flush its cache into shared cache. To support this, introduce three new memory orders:
Evaluation (in a simulator): baseline is global sync without work stealing. 7% improvement with local-only scope sync (still no work stealing). Global sync and work stealing is about 18% better. Remote sync (local sync + work stealing) helps the cases that work stealing doesn't help, for about 25% improvement.