September 20, 2020

New Deep Dive Reveals Secrets of AMD’s Zen 2 Architecture

It’s been a minute since I’ve referenced his work, but CPU software architect and low-level...

It’s been a minute since I’ve referenced his work, but CPU software architect and low-level feature researcher Agner Fog is still publishing periodic updates to his CPU manuals comparing the various AMD and Intel architectures. A recent update of his sheds light on a feature of AMD’s Zen 2 chip that’s gone previously unremarked.

Disclosure: I’ve worked with Agner Fog in the past on collecting data for his ongoing project, though not for several years.

Agner runs each platform through a laundry list of micro-targeted benchmarks, in order to suss out details of how they operate. The officially published instruction latency charts from AMD and Intel aren’t always accurate, and Agner has found undisclosed bugs in x86 CPUs before, including issues with how Piledriver executes AVX2 code and problems in the original Atom’s FPU pipeline.

For the most part, the low-level details will be familiar to anyone who has studied the evolution of the Zen and Zen 2 architectures. Maximum measured fetch throughput per thread is still 16-bytes, even though theoretically the CPU can support up to a 32-byte aligned fetch per clock cycle. The CPU is limited to a steady decode rate of 4 instructions per clock cycle, but it can burst up to six instructions in a single cycle if half of the instructions generate two micro-ops (uops) each. This doesn’t happen very often.

The theoretical size of the uop cache is 4096 uops, but the effective single-thread size, according to Agner, is about 2500 uops. With two threads, the effective size is nearly 2x larger. Loops that fit into the cache can execute at 5 instructions/clock cycle, with 6 again possible under certain circumstances. Low-level testing also confirmed some specific advances from Zen to Zen 2 — Zen can perform either two reads or a read and a write in the same cycle, while Zen 2 can perform two reads and a write, for example. The chart below shows how floating-point instructions are handled in different execution pipes depending on the task:

One previously undisclosed difference AMD introduced with Zen 2 is the ability to mirror memory operands. In some cases, this can significantly reduce the number of clock cycles to perform operations, from 15 down to 2. There are multiple preconditions for the mirroring to happen successfully: The instructions have to use general-purpose registers, the memory operands must have the same address, the operand size must be either 32 or 64 bits, and you may perform a 32-bit read after a 64-bit write to the same address, “but not vice versa.” A full list of required conditions is on Page 221, with discussion continuing on to page 222.

Since the feature is undocumented, it’s not clear if anyone has used it for anything practical in shipping code. Agner notes that it’s more useful in 32-bit mode, “where function parameters are normally transferred on the stack.” Agner notes that the CPU can also take a performance hit if the CPU makes certain incorrect assumptions. This may explain why the capability is undocumented — AMD might not have wanted to encourage developers to adopt a feature if it was likely to cause performance problems if used improperly. This last, to be clear, is supposition on my part.

Of Zen as a whole, Fog writes: “The conclusion for the Zen microarchitecture is that this is a quite efficient design with big caches, a big µop cache, and large execution units with a high throughput and low latencies.” I recommend both this manual and his other resources on x86 programming if you’re interested in the topic — you can learn a lot about the subtleties of how x86 CPUs perform this way, including the corner cases where what the instruction manual says should happen and what actually happens wind up being two different things.

Now Read:

Source link