I recently stumbled upon the idea of an Elimination Back-off Stack promising to be a parallel, linearizable and lock-free stack. In case you are not familiar with it, I would suggest either reading my previous post or the corresponding paper  itself. Being quite intrigued by the ideas of the above stack I wrote my own implementation in Rust with a little help from crossbeam.
In this post I will compare my implementation to other stack implementations. In particular I want to dive into some performance characteristics and their behavior in regards to the amount of threads (≤ 128) accessing the datastructure.
The code of the implementation itself, as well as all benchmarking logic is public at https://github.com/mxinden/elimination-backoff-stack/ .
Take this with a grain of salt. This is a hobby project, thus in no way sophisticated let alone ready for production.
Stacks of the day
(1) Simplest stack in the world
pop, wrapping it in a
Mutex for mutual exclusion and an
Arc for shared ownership makes it a concurrent stack. Let’s see whether an
Arc<Mutex<Vec<_>>> outperforms any of the more sophisticated implementations
below. Rephrased: At which point is lock-free programming worth the complexity
coming alongside with it?
(2) Plain lock-free Treiber stack
A lock-free stack, also often referred to as a Treiber
stack  due to Kent Treiber,
operates on top of a lock-free linked list. The entry point of the stack is an
atomic pointer which is either pointing to the node on the very top of the stack
null in case of an empty stack.
The stack linearizes concurrent access through that single atomic
pop operations loop trying to
compare_and_set it. Let’s
see how many threads it takes for that single serialization point (
Elimination back-off stack
A lock-free elimination back-off stack wraps a lock-free Treiber stack, but
instead of simply exponentially backing off on
compare_and_set failures, it
uses something called an elimination array to back-off in space instead of time.
Again for details please either read my previous
post or the corresponding paper
 itself. I will be waiting here.
(3) Switching back and forth
Each operation failing on the lock-free Treiber stack will try to exchange its value on the elimination array. A failure on the elimination array makes an operation try the stack again. This goes back and forth until the operation succeeds.
(4) Exponentially backing-off in space
If you are not familiar with the concept of backing-off in space instead of
time, please read the corresponding
This elimination back-off stack variation keeps track of a contention score.
Whenever an operation fails on the stack or the elimination array due to
contention the score is increased by one. Whenever an operation fails due to
missing contention, e.g. as a
push operation waiting for a
pop operation to
exchange with, the contention score is subtracted by two.
An operation failing on the stack will try to exchange with the opposite
operation type on the elimination array. It will consider
exchangers of the elimination array. See how each operation considers
exponentially more exchangers the more contention it experiences? That is the
idea of backing-off in space.
The benchmark described has been executed on a dual-socket machine with an AMD
EPYC 7601 32-Core Processors making up a total of 128 hyper-threads. The
benchmark is using criterion as a
benchmark library. Each test run takes as parameters a stack implementation and
the amount of threads to test with. Each thread executes 1_000 operations
pop) on the shared stack.
Again, take these numbers with a grain of salt. The benchmark above is highly artificial. One would need to test this under some real-world load in order to derive proper conclusions.
A single thread
While pretty useless in itself for a concurrent datastructure, having decent
single-threaded performance is still a nice-to have. The clear winner, who
would have thought, is the
Arc<Mutex<Vec<_>>> (1) with an average of
77.655 ns. While some outliers took longer, all stayed below
100 ns. Below you see a
graph depicting each result as well as the mean of the distribution.
Just as a comparison the lock-free stacks (2), (3) and (4) all took somewhere
4.6768 us. The similarities come at no suppries, given that a single
threaded elimination back-off stack is just a lock-free Treiber stack as there
is no contention.
Introducing a little contention with 8 threads
Earlier I asked the question at what point the complexity introduced through
lock-free programming is worth the gain. Having 8 threads hammering at the same
datastructure is definitely beyond that point. While the
(1) can offer a mean test execution time of
3.3933 ms, the lock-free Treiber
stack (2) runs through the test at a mean of
2.4423 ms. But one should not
just pay attention to averages. Looking at the best and worst cases depicted
below in the violin graph, the lock-free Treiber stack (2) outperforms all other
stacks by far. Just as a reminder, this is not the time per operation, but the
time it takes 8 threads to execute 1_000 operations each.
Operating under heavy load with 128 threads
While implementing and testing my elimination back-off stack (3) and (4) on my local laptop with an i7-8550U I got constantly disappointed as I couldn’t get it to outperform a lock-free Treiber stack (2). This would imply that the whole overhead and complexity of an elimination back-off stack (and at least one of my weekends) would go down the drain. Turns out, things change if you just throw more threads at it.
As we have noticed in the previous section the plain
falls behind once one operates with high contention. This has not changed now
that we are using 128 threads. The thing that we should definitely direct our
attention at though, is that the lock-free Treiber stack (2) and the elimination
back-off stack with exponential back-off in space (4) switched spots. (2)
executes a test run within an average of
106.11 ms, while (4) gets through a
test in nearly halve the time with
Also the best and worst cases spread across a smaller range (see violin chat above).
Open questions (suggestions welcome)
Testing, testing, testing. In particular my use of atomics, in addition any randomized linearization testing tools.
Collecting more data to optimize my current static assumptions. E.g. how long should a
pushoperation wait for a
popoperation on an Exchanger.
And of course the most important one: Why would one use a stack with 128 concurrent threads?
 Hendler, Danny, Nir Shavit, and Lena Yerushalmi. “A scalable lock-free stack algorithm.” Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures. 2004.
 Treiber, R. Kent. Systems programming: Coping with parallelism. New York: International Business Machines Incorporated, Thomas J. Watson Research Center, 1986.
Author Max Inden