Coherence Traffic in Manycore Processors with Opaque Distributed Directories
Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting “tiles”, each of which contains computation cores, local caches, and coherence masters. The distributed coherence subsystem must be queried for every out-of-tile access, imposing an overhead on memory latency. This paper studies the physical layout of an Intel Knights Landing processor, with a particular focus on the coherence subsystem, and uncovers the pseudo-random mapping function of physical memory blocks across the pieces of the distributed directory. Leveraging this knowledge, candidate optimizations to improve memory latency through the minimization of coherence traffic are studied. Although these optimizations do improve memory throughput, ultimately this does not translate into performance gains due to inherent overheads stemming from the computational complexity of the mapping functions.
💡 Research Summary
The paper investigates the coherence subsystem of Intel’s many‑core Knights Landing (KNL) processor, which uses the Intel Mesh interconnect (IM) to connect 38 tiles in a 2‑D mesh. Each tile contains two cores, a shared 1 MiB L2 cache, and a Caching/Home Agent (CHA) that holds a portion of the distributed MESIF directory. When a core accesses a memory block that is not present locally, it must query the appropriate CHA; the distance between the requesting tile and the owning CHA directly adds to memory latency.
The authors build on prior work that measured the latency from every tile to a fixed memory block in the high‑bandwidth MCDRAM, revealing up to a 27 % latency variation. They also leveraged performance‑counter‑derived maps that associate each of the 256 million 64‑byte cache lines in MCDRAM with a specific CHA. By analysing this data they reverse‑engineered the address‑to‑CHA mapping. The mapping is not a simple modulo operation; instead it is a pseudo‑random hash built from a combination of XOR, OR, and AND operations on address bits. The six bits needed to encode the 38 CHA identifiers are computed independently, with bits 0‑1 being simple XORs of a subset of address bits, while bits 2‑5 involve more intricate Boolean expressions. This design resembles hardware‑friendly CRC or LFSR hash functions, providing a uniform distribution of data across the four quadrants of the die but making the mapping opaque to software.
Armed with the closed‑form mapping, the authors explore two optimization strategies aimed at reducing coherence traffic and thus memory latency.
-
Dynamic (runtime) scheduling – An inspector‑executor model is replaced by a lightweight runtime that computes the target CHA for each memory access using the derived Boolean functions. Tasks are then scheduled on cores whose local CHA is close to the data’s CHA. This eliminates the costly performance‑counter inspection phase, reducing the overhead by roughly 30 % compared with the original inspector‑executor. However, evaluating the complex Boolean functions adds 5‑10 % runtime overhead, and while memory throughput improves, overall application execution time shows little or no gain; in some cases the extra computation outweighs the latency savings.
-
Static (compile‑time) scheduling – The compiler uses the mapping to place arrays and data structures so that elements that are frequently accessed together reside in the same or neighboring CHAs. This can improve bandwidth for regular, predictable access patterns (up to ~12 % increase in synthetic benchmarks). For irregular workloads, however, the static placement may cause suboptimal cache line usage and increased contention, leading to performance degradation.
The experimental evaluation confirms that both approaches can lower coherence‑induced latency and raise raw memory bandwidth, but the gains are largely masked by the overhead of evaluating the pseudo‑random mapping and by secondary effects such as increased instruction count or cache pressure. The authors conclude that, as long as the directory mapping remains a complex, non‑linear hash, software‑level optimizations will face diminishing returns. They suggest future work could focus on simplifying the hardware mapping, exposing CHA proximity information to the OS, or designing new scheduling algorithms that account for the mapping’s stochastic nature without incurring prohibitive computational cost.
Comments & Academic Discussion
Loading comments...
Leave a Comment