We have long been planning to cover the caching mechanisms in CPUs. As a shared knowledge base for the discussions in this session we chose the following two articles by Martin Thompson among other things known for his work on the LMAX Disruptor:
CPU Cache Flushing Fallacy including a good overview over the different caches in modern Intel CPUs.
Write Combining exemplifying the advanced mechanisms one can find in today’s CPUs and how one can make use of them.
Below I am listing a couple of take-aways of the session as well as further learning resources:
As a rule of thumb with each cache level the access latency quadruples (L1 1ns, L2 3ns, L3 12ns, DRAM 65ns).
Tyler wrote two great
tools
to show different CPU optimizations and and the impact of synchronization
on your hardware. Make sure you have a Rust environment set up. You can run
each of them via cargo run --bin <name> --release
.
increment
: Comparing different ways (volatile, volatile + release fence,
volatile + seqcst fence, seqcst CAS, …) of incrementing a counter 500
million times thus showing the impact of the cache coherence protocol e.g.
load & store buffer flushes.
write_combining
: Enabling one to see the write combining optimization in
action described by Martin Thompson in the blog
post
mentioned above. Looking at the latency jumps one can estimate the line
fill buffer length of ones CPU.
During the session Tyler often showed excerpts from the Intel manuals, in particular:
Chapter 8 and 11 from the Software developers manual. It is worth taking a look at the memory ordering section in 8.2.2.
CPUs expose a lot of performance metrics. On Linux and with Intel Sandy bridge CPUs one can take a look at the corresponding Kernel subsystem.