mxinden

Homepage of Max Inden

Elimination back-off stack

Posted at — Mar 28, 2020

Reading The Art of Multiprocessor Programming [1] I came across the Elimination Back-off Stack [2] datastructure introduced in 2004 by Danny Hendler, Nir Shavit, and Lena Yerushalmi. It promises to be a parallel lock-free stack.

How can a stack allow parallel operations without going through a single serialization point, e.g. a Mutex or an Atomic? Let’s dive into it.

A lock-free stack

A lock-free stack, also often referred to as a Treiber stack [3] due to Kent Treiber, operates on top of a lock-free linked list. The entry point of the stack is an atomic pointer which is either pointing to the node on the very top of the stack or is null in case of an empty stack.

pub struct Stack<T> {
    head: Atomic<Node<T>>,
}

Each node on the stack contains its data as well as an atomic next pointer which is either null in the case where it is the last node of the stack (bottom plate) or a pointer to the next node.

struct Node<T> {
    data: ManuallyDrop<T>,
    next: Atomic<Node<T>>,
}

Push and pop operation

A push operation creates a new Node, reads the current head, sets the next pointer of its new node to head and then tries to compare_and_set head to point to its new node. It loops until the final compare_and_set succeeds. A pop operation reads the current head. If the stack is empty head will be null thus pop can return. If the stack is not empty head will point to the first node. It reads that node and tries to compare_and_set head to point to the second node instead of the first one, again looping until the compare_and_set succeeds. On success it returns the data field of the first node.

Empty stack:

+----+
|head| --> null
+----+

Non-empty stack:

+----+     +----+     +----+     +----+
|head| --> |next| --> |next| --> |next| --> null
+----+     |data|     |data|     |data|
           +----+     +----+     +----+

Performance

The lock-free stack implementation above linearizes concurrent access through a single atomic head pointer on which push and pop operations loop trying to compare_and_set it. This single sequential execution point introduces contention when multiple threads try to push or pop at the same time thus making the entire stack implementation scale poorly. One can use exponential back-off by having a thread wait for a bit on compare_and_set failure before retrying.

Can we do better?

A lock-free elimination back-off stack

A lock-free elimination back-off stack wraps a lock-free Treiber stack, but instead of simply exponentially backing off on compare_and_set failures, it uses something called an elimination array to back-off in space instead of time.

Elimination Array

The stack datastructure allows two operations: push and pop. A push followed by a pop leaves a given stack in the same state as it was before the two operations. Thus the two operations cancel out. An elimination array is a fixed size array where each slot enables a thread executing a push operation to hand its item over to a thread executing a pop operation canceling the two out. A reasonable size for the elimination array would be the amount of threads operating on the datastructure.

Initial state

+----+     +----+     +----+     +----+
|head| --> |next| --> |next| --> |next| --> null
+----+     |abcd|     |efgh|     |ijkl|
           +----+     +----+     +----+

After push operation

+----+     +----+     +----+     +----+     +----+
|head| --> |next| --> |next| --> |next| --> |next| --> null
+----+     |push|     |abcd|     |efgh|     |ijkl|
           +----+     +----+     +----+     +----+

After pop operation == initial state

+----+     +----+     +----+     +----+
|head| --> |next| --> |next| --> |next| --> null
+----+     |abcd|     |efgh|     |ijkl|
           +----+     +----+     +----+

Both push and pop try to succeed on the lock-free stack first. On contention an operation’s compare_and_set is likely to fail. Instead of backing-off by waiting a bit and retrying on the stack, an operation would try to exchange with its counterpart on the elimination array instead.

pub struct EliminationArray<T> {
    exchangers: Vec<Exchanger<T>>,
}

pub struct Exchanger<T> {
    item: Atomic<Item<T>>,
}

enum Item<T> {
    Empty,
    Waiting(ManuallyDrop<T>),
    Busy,
}

In order for a push operation to exchange its data with a pop operation, the push operation first selects an Exchanger within the elimination array at random. It then checks the item field of the Exchanger.

In order for a pop operation to eliminate with a push operation and receive its data it picks, just like a push operation, an Exchanger within the elimination array at random. It then checks the item field of that Exchanger.

Backing-off in space instead of time

Steps of push and pop operations can fail on the elimination array due to too much contention or no contention. An example for the former would be a pop operation failing to compare_and_set an Item from Waiting to Busy due to another pop operation getting there first. On such contention one could back-off by exponentially waiting a bit. Instead operations on the elimination array back-off in space and not time. Backing-off in space instead of time enables the elimination back-off stack to be used by multiple operations in parallel. They do so by increasing the amount of Exchangers they consider when randomly selecting one from the elimination array.

Backing-off in space instead of time enables the elimination back-off stack to be used by multiple operations in parallel.

For example at first a pop operation that failed on the lock-free stack selects one out of the two first exchangers of the elimination array. If it fails on that Exchanger due to contention it selects one out of the four first exchangers of the elimination array.

Liveness

To ensure a push operation does not starve waiting for a pop operation, or a pop operation unsuccessfully looking for a push operation on different Exchangers, they should first of all decrease the amount of Exchangers they consider out of all the Exchangers when witnessing missing contention. Second of all operations should eventually try the lock-free stack again, given that with low contention they will likely succeed on it directly.

Conclusion

Combining a lock-free stack with an elimination array results in a lock-free linearizable and parallel stack. Linearizable as push and pop operations appear to take effect instantaneously for all threads at some moment between their invocation and response. Parallel due to the fact that multiple push and pop operations can cancel each other out on the elimination array.


References

[1] Herlihy, Maurice, and Nir Shavit. The art of multiprocessor programming. Morgan Kaufmann, 2011.

[2] Hendler, Danny, Nir Shavit, and Lena Yerushalmi. “A scalable lock-free stack algorithm.” Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures. 2004.

[3] Treiber, R. Kent. Systems programming: Coping with parallelism. New York: International Business Machines Incorporated, Thomas J. Watson Research Center, 1986.