hckrnws
It is not just a way of writing ring buffers. It's a way of implementing concurrent non-blocking single-reader single-writer atomic ring buffers with only atomic load and store (and memory barriers).
The author says that non-power-of-two is not possible, but I'm pretty sure it is if you use a conditional instead of integer modulus.
I first learnt of this technique from Phil Burk, we've been using it in PortAudio forever. The technique is also widely known in FPGA/hardware circles, see:
"Simulation and Synthesis Techniques for Asynchronous FIFO Design", Clifford E. Cummings, Sunburst Design, Inc.
https://twins.ee.nctu.edu.tw/courses/ip_core_04/resource_pdf...
A couple of the comments to the article suggest using 64-bit numbers, which is exactly the right solution. 2^64 nanoseconds=584.55 years - overflow is implausible for any realistic use case. Even pathological cases will struggle to induce wraparound at a human timescale.
(People will probably moan at the idea of restarting the process periodically rather than fixing the issue properly, but when the period would be something like 50 years I don't think it's actually a problem.)
I think unfortunately we sometimes ascribe to powers of two supernatural powers that are really about caches being built in powers of two.
Intel is still 64 byte cache lines as they have been for quite a long time but they also do some shenanigans on the bus where they try to fetch two lines when you ask for one. So there’s ostensibly some benefit of aligning data particularly on linear scans to 128 byte alignment for cold cache access.
But there's a reason that caches are always sized in powers of two as well, and that same reason is applicable to high-performance ring buffers: Division by powers of two is easy and easy is fast. It's reliably a single cycle, compared to division by arbitrary 32bit integers which can be 8-30 cycles depending on CPU.
Also, there's another benefit downstream of that one: Powers of two work as a schelling point for allocations. Picking powers of two for resizable vectors maximizes "good luck" when you malloc/realloc in most allocators, in part because e.g. a buddy allocator is probably also implemented using power-of-two allocations for the above reason, but also for the plain reason that other users of the same allocator are more likely to have requested power of two allocations. Spontaneous coordination is a benefit all its own. Almost supernatural! :)
powers-of-two are problematic with growable arrays on small heaps. You risk ending up with fragmented space you can't allocate unless you keep growth less than 1.61x, which would necessitate data structures that can deal with arbitrary sizes.
Non-power-of-two is only really feasible of the total number of inserts will fit in your post/ack counters. Otherwise you have to implement overflow manually which may or may not be possible to do with the available atomic primitives on your architecture.
I first encountered this structure at a summer internship at a company making data switches.
Comment was deleted :(
Your link has an invalid cert FYI, but do appreciate the knowledge drop. Rung buffers are some of the cooler data structures out there.
> So there I was, implementing a one element ring buffer. Which, I'm sure you'll agree, is a perfectly reasonable data structure.
It is, but, IMO, shouldn’t use the code for “a n-element ring buffer, with n set to 1”, similarly to how an array of booleans in many languages shouldn’t be implemented as “an arrayof Foos, with Foo set to bool”.
C++ has std::bitset and std::vector and Java similarly has BitSet and Array because using the generic code for arrays of bits is too wasteful.
Similarly, a one-element ring buffer is either full or it is empty. Why use two indexes to encode a single boolean?
> C++ has std::bitset and std::vector and Java similarly has BitSet and Array because using the generic code for arrays of bits is too wasteful.
Rather infamously, C++ tried to be clever here and std::vector<bool> is not just a vector-of-bools but instead a totally different vector-ish type that lacks many of the important properties of every other instantiation of std::vector. Yes, a lot of the time you want the space efficiency of a dynamic bitset, rather than wasting an extra 7 bits per element. But also quite often you do want the behavior of a "real" std::vector for true/false values, and then you have to work around it manually (usually via std::vector<uint8_t> or similar) to get the expected behavior.
It was for a dynamically growing ring buffer that also did short-object optimization. The natural implementation was to have the capacity and the offsets stored in fixed locations and with a fixed width, and have the variable part be a union of pointer or inline byte buffer.
Depending on the element width, you'd have space for different amounts of data in the inline buffer. Sometimes 1, sometimes a few more. Specializing for a one-element inline buffer would be quite complex with limited gains.
In retrospect trying to use that as a running gag for the blog post did not work well without actually giving the full context, but the full context would have been a distraction.
> C++ has std::bitset and std::vector
Notably, this is not the case. C++ std::vector is specialised for bools to pack bits into words, causing an untold array (heh) of headaches.
And "wasteful" is doing a lot of lifting here. In terms of memory usage? Yes. In terms of CPU? The other way around.
> In terms of CPU? The other way around.
That depends on your architecture and access pattern. In case of sequential access, packed bools may perform better due to arithmetic being usually way cheaper than memory operations.
As far as I know, the last approach is the only way to implement efficient lock-free ring-buffer
There is one more way that is truly lock free. Most lock free implementations relying on atomic compare and swap instructions are not lock free afaik; they have a lock on the cache line in the CPU (in a way you go away from global lock to many distributed locks).
There is one more mechanism that allows implementing ring buffers without having to compare head and tail buffers at all (and doesn’t rely on counters or empty/full flags etc) that piggybacks on the cache consistency protocol
That's not how "lock free" is defined/used. If you are considering the MESI M state to be a "lock" then you also have to grant that any write instruction is a "lock".
Those hardware-level locks are typically not considered because they work quite differently. A standard software mutex can cause other threads to block indefinitely if, for example, the thread holding the mutex gets preempted for a long time. "Lock free" isn't really about the locks, it's about a guarantee that the system makes progress.
In this sense, the hardware locks used for atomic instructions don't really count, because they're implemented such that they can only be held for a brief, well defined time. There's no equivalent to suspending a thread while it holds a lock, causing all other threads to wait for an arbitrary amount of time.
Interesting! Do you know of an example implementation of this?
I’m jealous of people, who have to write ring buffers for work.
It feels like 90% swe jobs these days are about writing CRUD wrappers.
Sorry.
Mostly Type 1 and overflow is a diagnostic log at most. Losing all stale unprocessed data and leaving a ready empty buffer behind is often the desired outcome.
Type 3 is probably banned on most codebases because of the integer overflow.
Jokes on me, when I need them, I don't feel like writing them so I just pick up an old one and tweak it. Or just tell Claude to build me one and it one shots it.
Related. Others?
I've been writing ring buffers wrong all these years - https://news.ycombinator.com/item?id=13175832 - Dec 2016 (167 comments)
Every implementation of "the lmax disrupter" I've come across uses this trick.
Crafted by Rajat
Source Code