Enabling Efficient Communication in Large Heterogeneous Processors

Brad Beckmann, from AMD research

March 6, 2015

Heterogenous computing

Motivation: CPU frequency basically plateaued around 2005, where we hit the power wall. Can see that if we plot the power density against time; we peaked around 1 Watt per square millimeter. Power and energy have become the limiting factors for all of our computing. Not just in heat, but battery life and such as well.

We have two approaches to improve energy efficiency. * Increase parallelism. Normalized energy per operation is nonlinear (in a bad way) for single-core systems. * Reduce overhead. I-cache, register file and such use a lot more energy than we actually use to perform an addition, for example.

So heterogeneous processors. In particular, GPUs which are very efficient for massively parallel operations. Tons of threads, and reduced overhead via simple core architecture.

GPUs are kind of hard to use though.

Need data parallelism.
Generally programmed in specialized languages.
Need large tasks to overcome large overhead (eg moving data to GPU memory)

We still need CPUs, because GPUs are bad for plenty of use cases. HSA addresses the second and third points, which are artifacts of how we've traditionally used GPUs.

Current heterogeneous processors

AMD Kaveri and HSA.

HSA tightly couples a CPU and GPU. Main points:

hUMA: unified memory architecture
hQ: heterogeneous queuing
HSAIL: intermediate language

The traditional discrete GPU has its own memory which we talk to via PCIe. The requisite explicit copy is high-latency and low-bandwidth. We need to run big computations to amortize that copy cost. Further, GPU memory tends to be small.

So we rip out the PCIe link and give the CPU and GPU a shared coherent memory. The GPU uses the same virtual addresses as the CPU, and we don't need to explicitly copy data between them.

(Aside: in this implementation the GPU has its own TLBs, and an IOMMU is used to handle GPU page faults on the CPU.)

Queueing

Traditionally, running tasks on the GPU involves going through a number of software layers, and it's very expensive. With HSA, user-level software talks directly to the hardware via HSA AQL. No drivers involved, because the hardware understands AQL (which is vendor-independent).

This allows the hardware to schedule tasks itself, and hugely reduce dispatch overhead. Interesting implication: GPU kernels may queue new tasks themselves.

HSA interfaces

There are some vendor-specific libraries supporting use of HSA. You've got a helper library that the user interacts with, which feeds the core runtime and eventually feeds things into the finalizer which generates GPU code. HSAIL is the intermediate representation which users feed into the runtime libraries.

http://hsafoundation.com http://github.com/HSAFoundation

There are multiple high level compilers which can create and consume HSAIL today, including Clang/LLVM.

Kaveri's HSA

Kaveri implements HSA.

A quick benchmark: Kaveri consistently gets 3-5x as much performance as a 4-core CPU in a binary search application. This application on a legacy (non-HSA) APU is really bad because the overheads kill it.

Scalable communication and synchronization

That is, how we enable a shared coherent memory model efficiently.

Key challenges:

Avoid unnecessary communication and coherence overhead. Perform hardware coherence at coarser granularity (> cache line, ~1kb reduces bandwidth needs by about 95%).
Reduce synchronization penalties. With a lot of threads, synchronization becomes more expensive.

We're concerned with the second point here.

Background

Parallel synchronization semantics:

Acquire: pull the latest data in
Release: push latest data out

Scopes bound synchronization. Smaller scope (eg thread-local) requires less synchronization overhead. Conveniently, scoping like this is already baked into today's GPU programming models.

Work item
Wavefront (items in a single SIMD unit)
Work group (Wavefronts on a single compute unit)
Component (a single GPU)

These levels map quite conveniently to levels of the memory hierarchy. Wavefront is L1, component in L2, etc.

First step of a coherent operation in today's GPUs is to invalidate the L1, since there's no fine-grained coherence in L1. Say we're doing this at component scope, so invalidate the cache to force a read at component scope (from L2). Caches are write-through, so we know we'll get latest data in and push it out correctly.

The same thing in workgroup scope is even easier: just hit L1. So hitting workgroup-private memory is super-efficient. (Writes still get written through, but the workgroup-private write need not wait for the write to complete before continuing).

Static local sharing with this kind of scoping is really great. Dynamic sharing between workgroups is less good. Workgroup sync on current hardware provides an easy ~20% performance boost.

HRF memory

Formalization: "Heterogeneous-Race-Free Memory Models", Hower et al, ASPLOS 2014.

Transitivity matters. In an HRF-direct memory model if A syncs B and B syncs C, A does not sync C. This can be more efficient. Must make a tradeoff between simplicity and performance when doing this, though.

Not so good if there is dynamic local sharing- some threads access shared data less frequently and dynamically. Things like work-stealing.

Insight: if wg1 wants to access wg0's data, it needs to trigger scope promotion of wg0's scope. Essentially, ask wg0 to flush its cache into shared cache. To support this, introduce three new memory orders:

remote acquire
remote release
remote acqrel

Evaluation (in a simulator): baseline is global sync without work stealing. 7% improvement with local-only scope sync (still no work stealing). Global sync and work stealing is about 18% better. Remote sync (local sync + work stealing) helps the cases that work stealing doesn't help, for about 25% improvement.