We have long been planning to cover the caching mechanisms in CPUs. As a shared knowledge base for the discussions in this session we chose the following two articles by Martin Thompson among other things known for his work on the LMAX Disruptor:
-
CPU Cache Flushing Fallacy including a good overview over the different caches in modern Intel CPUs.
-
Write Combining exemplifying the advanced mechanisms one can find in today’s CPUs and how one can make use of them.
Below I am listing a couple of take-aways of the session as well as further learning resources:
-
As a rule of thumb with each cache level the access latency quadruples (L1 1ns, L2 3ns, L3 12ns, DRAM 65ns).
-
Tyler wrote two great tools to show different CPU optimizations and and the impact of synchronization on your hardware. Make sure you have a Rust environment set up. You can run each of them via
cargo run --bin <name> --release
.-
increment
: Comparing different ways (volatile, volatile + release fence, volatile + seqcst fence, seqcst CAS, …) of incrementing a counter 500 million times thus showing the impact of the cache coherence protocol e.g. load & store buffer flushes. -
write_combining
: Enabling one to see the write combining optimization in action described by Martin Thompson in the blog post mentioned above. Looking at the latency jumps one can estimate the line fill buffer length of ones CPU.
-
-
During the session Tyler often showed excerpts from the Intel manuals, in particular:
-
Chapter 8 and 11 from the Software developers manual. It is worth taking a look at the memory ordering section in 8.2.2.
-
-
CPUs expose a lot of performance metrics. On Linux and with Intel Sandy bridge CPUs one can take a look at the corresponding Kernel subsystem.