<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://joey-forever.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://joey-forever.github.io/" rel="alternate" type="text/html" /><updated>2026-04-19T13:43:21+00:00</updated><id>https://joey-forever.github.io/feed.xml</id><title type="html">Joey’s Notes on Concurrency and Performance</title><subtitle>Writings on concurrent data structures, C++ performance engineering, and systems programming.</subtitle><author><name>Joey</name></author><entry><title type="html">A Concurrent Red-Black Tree That Beats folly::ConcurrentSkipList on Read-Heavy Workloads</title><link href="https://joey-forever.github.io/2026/04/19/concurrent-rbtree/" rel="alternate" type="text/html" title="A Concurrent Red-Black Tree That Beats folly::ConcurrentSkipList on Read-Heavy Workloads" /><published>2026-04-19T00:00:00+00:00</published><updated>2026-04-19T00:00:00+00:00</updated><id>https://joey-forever.github.io/2026/04/19/concurrent-rbtree</id><content type="html" xml:base="https://joey-forever.github.io/2026/04/19/concurrent-rbtree/"><![CDATA[<p><strong>TL;DR:</strong> I built a header-only C++17 concurrent ordered set — <a href="https://github.com/Joey-Forever/ConcurrentRBTree"><code class="language-plaintext highlighter-rouge">gipsy_danger::ConcurrentRBTree</code></a>. On an Intel i9-13900K, it delivers <strong>1.52× aggregate throughput</strong> and <strong>up to 1.55× read throughput</strong> over <code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList</code> on 16-thread, 8-million-entry read-heavy workloads. The API is a drop-in replacement. The trick isn’t a new algorithm — it’s a dual-indexed layout that keeps reads fully lock-free while letting the tree itself stay dead-simple inside a single-writer critical section. This post explains how, why it actually wins on real silicon (hint: it’s <em>not</em> the cache hit rate), and where it loses.</p>

<p align="center">
  <img src="/assets/throughput_16threads_8000000init.jpg" alt="ConcurrentRBTree vs folly::ConcurrentSkipList — 16 threads, 8M entries, Intel i9-13900K" width="900" />
  <br />
  <em>16 threads · 8M <code>int32_t</code> entries · Intel i9-13900K</em>
</p>

<hr />

<h2 id="the-question-i-was-trying-to-answer">The question I was trying to answer</h2>

<p><code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList</code> is the de-facto C++ concurrent ordered set. It scales writes beautifully via CAS-based splicing and has been battle-tested inside Meta for years. For most workloads, it is the right answer.</p>

<p>But the workloads I care about — in-memory caches, real-time indexes, feature stores — share one property: <strong>reads dominate, usually more than 90% of all operations.</strong> And on those workloads, I had a nagging suspicion that a skip list’s constant factor was leaving performance on the table.</p>

<p>Skip lists are probabilistically balanced. A lookup on an <em>N</em>-entry skip list touches roughly <code class="language-plaintext highlighter-rouge">1.44 × log₂(N)</code> forward pointers — each potentially on a distinct cache line. A red-black tree lookup touches <code class="language-plaintext highlighter-rouge">log₂(N)</code> child pointers, and the top few tree levels stay hot in L1 across all threads. The Big-O is identical. The constants should not be.</p>

<p>So I asked: <strong>can a concurrent red-black tree compete with <code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList</code>, given all the hand-over-hand locking nightmares that usually come with tree-based concurrency?</strong></p>

<p>Turns out: yes — if you are willing to make one specific trade-off.</p>

<hr />

<h2 id="the-key-insight-the-tree-is-a-hint-the-list-is-the-truth">The key insight: the tree is a hint, the list is the truth</h2>

<p>The traditional problem with concurrent balanced trees is that rotations perturb search paths. A reader mid-descent can land on a subtree that just rotated away from under it — and now it is looking at a node whose subtree contains the wrong keys. Classical solutions (Bronson et al.’s relaxed-AVL, hand-over-hand locking, optimistic versioning) all try to make the tree itself tolerate concurrent writes.</p>

<p>I went the other way.</p>

<p>The tree in <code class="language-plaintext highlighter-rouge">ConcurrentRBTree</code> is <strong>an approximate index, not the source of truth.</strong> Alongside it, I maintain a <strong>sorted singly-linked list</strong> threaded through all data nodes in key order, with acquire/release semantics on the <code class="language-plaintext highlighter-rouge">next_</code> pointers. Both structures point at the same underlying nodes:</p>

<p align="center">
  <img src="/assets/rbtree.svg" alt="Red-Black Tree — approximate index with relaxed-atomic child pointers" width="720" />
</p>

<p align="center">
  <img src="/assets/sorted_list.svg" alt="Sorted Linked List — authoritative order with acquire/release next_ pointers" width="800" />
</p>

<p align="center">
  <em>Two indexes, one set of nodes: every blue box in the list and every colored circle in the tree is the same physical <code>Node*</code> in memory.</em>
</p>

<p>Why this works: <strong>rotations do not change sorted order.</strong> If a reader’s tree descent lands on a predecessor that is <em>close</em> to the right position but off by a few slots (because a rotation just happened), walking the linked list forward from that predecessor will still find the correct target in <em>O(1)</em> expected steps.</p>

<p>Reads therefore proceed like this:</p>

<ol>
  <li>Descend the tree with <strong>relaxed-atomic loads</strong> to find an approximate <code class="language-plaintext highlighter-rouge">less_bound</code>.</li>
  <li>Walk the linked list forward <strong>up to 3 steps</strong> looking for the exact target.</li>
  <li>If 3 steps is not enough, assume rotation interference and retry from the root.</li>
</ol>

<p>No locks on the read path. Ever.</p>

<hr />

<h2 id="the-write-path-serialize-only-the-commit">The write path: serialize only the commit</h2>

<p>Writers also use the tree as a hint. Every writer thread independently descends the tree lock-free, finding an <code class="language-plaintext highlighter-rouge">estimated_less_bound</code>. Only then does it acquire a single cache-line-aligned <code class="language-plaintext highlighter-rouge">std::atomic_flag</code> spinlock. Inside the critical section, it:</p>

<ol>
  <li>Refines the bound by walking the linked list (same 3-step check as readers).</li>
  <li>If the walk fails, releases the lock and retries from the top.</li>
  <li>Otherwise: splices the new node into the linked list (acquire/release <code class="language-plaintext highlighter-rouge">next_</code>), attaches or detaches it in the tree (relaxed stores — we are the only writer), performs rebalancing, and finally toggles the node’s <code class="language-plaintext highlighter-rouge">accessible_</code> flag (acquire/release) to make it visible to readers.</li>
</ol>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">while</span> <span class="p">(</span><span class="nb">true</span><span class="p">)</span> <span class="p">{</span>
    <span class="k">auto</span> <span class="n">estimated</span> <span class="o">=</span> <span class="n">findEstimatedLessBoundForWrite</span><span class="p">(</span><span class="n">value</span><span class="p">);</span>           <span class="c1">// lock-free descent</span>
    <span class="k">while</span> <span class="p">(</span><span class="n">write_leader_flag_</span><span class="p">.</span><span class="n">test_and_set</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_acquire</span><span class="p">))</span>
        <span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="o">::</span><span class="n">yield</span><span class="p">();</span>

    <span class="k">auto</span> <span class="n">exact</span> <span class="o">=</span> <span class="n">findExactLessBoundForWrite</span><span class="p">(</span><span class="n">estimated</span><span class="p">,</span> <span class="n">value</span><span class="p">);</span>        <span class="c1">// validate via list</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">exact</span> <span class="o">==</span> <span class="nb">nullptr</span><span class="p">)</span> <span class="p">{</span> <span class="n">status</span> <span class="o">=</span> <span class="n">RETRY</span><span class="p">;</span> <span class="p">}</span>
    <span class="k">else</span> <span class="p">{</span> <span class="n">internalInsert</span><span class="p">(</span><span class="n">new_node</span><span class="p">,</span> <span class="n">exact</span><span class="p">);</span> <span class="n">status</span> <span class="o">=</span> <span class="n">SUCCESS</span><span class="p">;</span> <span class="p">}</span>

    <span class="n">write_leader_flag_</span><span class="p">.</span><span class="n">clear</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">memory_order_release</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">status</span> <span class="o">!=</span> <span class="n">RETRY</span><span class="p">)</span> <span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The structural mutation is serialized across writers. <strong>This is the trade-off.</strong> I gave up fine-grained write concurrency in exchange for:</p>

<ul>
  <li>A tree that never needs per-node locks.</li>
  <li>A commit section that runs in tens of cycles — short enough to not matter in read-heavy workloads.</li>
  <li>Implementation simplicity: the whole library is ~1000 lines of C++17 in a single header.</li>
</ul>

<p>Because only one writer ever mutates the tree at a time, every child-pointer store inside the critical section uses <code class="language-plaintext highlighter-rouge">memory_order_relaxed</code>. Acquire/release fencing is needed on exactly two edges: the <code class="language-plaintext highlighter-rouge">next_</code> pointer of the sorted list, and the per-node <code class="language-plaintext highlighter-rouge">accessible_</code> flag — the two things readers actually follow.</p>

<hr />

<h2 id="why-its-actually-faster-and-its-not-what-youd-guess">Why it’s actually faster (and it’s not what you’d guess)</h2>

<p>Here is what surprised me the first time I ran <code class="language-plaintext highlighter-rouge">perf stat</code>. Workload: 10% writes, 16 threads, 8M entries.</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>ConcurrentRBTree</th>
      <th>ConcurrentSkipList</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Data references</td>
      <td>8.08 × 10⁹</td>
      <td>15.19 × 10⁹</td>
    </tr>
    <tr>
      <td>Instructions</td>
      <td>22.0 × 10⁹</td>
      <td>42.0 × 10⁹</td>
    </tr>
    <tr>
      <td>Branches</td>
      <td>4.41 × 10⁹</td>
      <td>8.74 × 10⁹</td>
    </tr>
    <tr>
      <td>L1d miss rate</td>
      <td>11.0%</td>
      <td><strong>9.2%</strong></td>
    </tr>
    <tr>
      <td>LLd miss rate</td>
      <td>4.1%</td>
      <td><strong>2.4%</strong></td>
    </tr>
    <tr>
      <td>Branch mispredict rate</td>
      <td>13.1%</td>
      <td><strong>9.1%</strong></td>
    </tr>
  </tbody>
</table>

<p align="center">
  <img src="/assets/cpu_cache_usage.jpg" alt="perf stat — CPU cache and branch behavior" width="700" />
</p>

<p>Read the last three rows twice. <strong>The skip list has better cache hit rates and better branch prediction than my tree.</strong> Its nodes are smaller, its pointer-chase pattern is more predictable.</p>

<p>And yet — the tree wins by 1.5×. How?</p>

<p>Look at the first three rows. The tree does <strong>roughly half</strong> as many data references, half as many instructions, and half as many branches per operation. Even with a worse <em>per-access</em> miss rate, the <em>absolute</em> number of L1d misses is lower (892M vs. 1.40B), because there are far fewer accesses in total. The same holds for absolute branch mispredicts (578M vs. 794M).</p>

<p>The mental model I landed on: <strong>a red-black tree operation is more expensive per step, but does half as many steps.</strong> A skip list operation is cheaper per step, but does twice as many. In read-heavy regimes, the second effect dominates. Flip the workload to write-heavy with a tight working set and the story reverses.</p>

<p>This is, retroactively, the boring explanation from any data-structures textbook: both structures are <em>O(log n)</em>, but constant factors matter, and the red-black tree’s constant is about 1/2 that of the skip list. I just had not seen it measured cleanly in a concurrent C++ setting before.</p>

<hr />

<h2 id="where-it-loses-and-you-should-care">Where it loses (and you should care)</h2>

<p>Every benchmark post over-claims. I want to be honest about this.</p>

<p><strong>1. Heavy writes on tiny working sets.</strong> At 27 threads and 100K entries (comfortably L2-resident) with write probability 0.5, <code class="language-plaintext highlighter-rouge">ConcurrentRBTree</code> runs at 0.73× of <code class="language-plaintext highlighter-rouge">ConcurrentSkipList</code>’s throughput. The single writer-side spinlock serializes all mutations, and on workloads where the lock-free descent is short, the critical section dominates total time. <code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList</code>’s CAS-based writer scales linearly here.</p>

<p><strong>2. Memory overhead.</strong> Each tree node carries a parent pointer, two child pointers, a <code class="language-plaintext highlighter-rouge">next_</code> pointer, a color byte, and an <code class="language-plaintext highlighter-rouge">accessible_</code> atomic bool. At 4-byte values, nodes are about 30% larger than a skip-list node of expected height close to 1. The overhead shrinks linearly as value size grows — by 64-byte values, it is under 5%.</p>

<p><strong>3. No custom comparator or <code class="language-plaintext highlighter-rouge">const_iterator</code> yet.</strong> Both are on the roadmap.</p>

<p>The decision rule I actually use:</p>

<table>
  <thead>
    <tr>
      <th>Your workload</th>
      <th>Pick</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Reads &gt; 80%, working set &gt; L2</td>
      <td><strong>ConcurrentRBTree</strong></td>
    </tr>
    <tr>
      <td>Writes &gt; 20%</td>
      <td>ConcurrentSkipList</td>
    </tr>
    <tr>
      <td>Tiny set, extreme thread count, mixed workload</td>
      <td>ConcurrentSkipList</td>
    </tr>
    <tr>
      <td>Range scans matter a lot</td>
      <td><strong>ConcurrentRBTree</strong></td>
    </tr>
    <tr>
      <td>Memory-constrained, tiny values</td>
      <td>ConcurrentSkipList</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="a-note-on-range-scans">A note on range scans</h2>

<p>One payoff I did not expect: <strong>range scans are nearly free.</strong></p>

<p>A classical red-black tree range scan has to do an in-order traversal — explicit stack or parent-pointer walks, cost per edge roughly 3× a simple list <code class="language-plaintext highlighter-rouge">next</code>. Here, <code class="language-plaintext highlighter-rouge">iterator::operator++</code> just calls <code class="language-plaintext highlighter-rouge">accessibleNext()</code> on the underlying linked list. Cost is the same as iterating a <code class="language-plaintext highlighter-rouge">std::list</code>.</p>

<p>If your workload does <code class="language-plaintext highlighter-rouge">lower_bound</code> + sequential scan (time-series buckets, log queries, range indexes), this is probably the biggest unspoken win — and it comes directly out of the dual-index design rather than being a separate feature.</p>

<hr />

<h2 id="api-its-a-drop-in-for-follyconcurrentskiplist">API: it’s a drop-in for <code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList</code></h2>

<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;ConcurrentRBTree.h&gt;</span><span class="cp">
</span><span class="k">using</span> <span class="k">namespace</span> <span class="n">gipsy_danger</span><span class="p">;</span>

<span class="k">auto</span> <span class="n">tree</span> <span class="o">=</span> <span class="n">ConcurrentRBTree</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;::</span><span class="n">createInstance</span><span class="p">();</span>
<span class="n">ConcurrentRBTree</span><span class="o">&lt;</span><span class="kt">int</span><span class="o">&gt;::</span><span class="n">Accessor</span> <span class="nf">accessor</span><span class="p">(</span><span class="n">tree</span><span class="p">);</span>

<span class="n">accessor</span><span class="p">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">42</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">accessor</span><span class="p">.</span><span class="n">find</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span> <span class="o">!=</span> <span class="n">accessor</span><span class="p">.</span><span class="n">end</span><span class="p">())</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">}</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="n">it</span> <span class="o">=</span> <span class="n">accessor</span><span class="p">.</span><span class="n">lower_bound</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span> <span class="n">it</span> <span class="o">!=</span> <span class="n">accessor</span><span class="p">.</span><span class="n">end</span><span class="p">();</span> <span class="o">++</span><span class="n">it</span><span class="p">)</span> <span class="p">{</span> <span class="cm">/* ... */</span> <span class="p">}</span>
<span class="n">accessor</span><span class="p">.</span><span class="n">erase</span><span class="p">(</span><span class="mi">42</span><span class="p">);</span>
</code></pre></div></div>

<p>Compile with <code class="language-plaintext highlighter-rouge">-std=c++17 -DNDEBUG -I/path/to/include</code>. Nothing else.</p>

<p>The <code class="language-plaintext highlighter-rouge">Accessor</code> doubles as an epoch guard for an internal <code class="language-plaintext highlighter-rouge">NodeRecycler</code> that is directly modeled on <code class="language-plaintext highlighter-rouge">folly::ConcurrentSkipList::NodeRecycler</code>: erased nodes stay alive until the last live <code class="language-plaintext highlighter-rouge">Accessor</code> releases its reference. Same weak-consistency guarantees as Folly — readers may see stale data but never corrupted state.</p>

<hr />

<h2 id="where-to-go-next">Where to go next</h2>

<ul>
  <li><strong>Code:</strong> <a href="https://github.com/Joey-Forever/ConcurrentRBTree">github.com/Joey-Forever/ConcurrentRBTree</a>. Header-only, MIT-licensed, ~1000 lines of C++17.</li>
  <li><strong>Full benchmark matrix</strong> (3 thread counts × 4 working sizes × 16 write probabilities, 12 plots): <a href="https://github.com/Joey-Forever/ConcurrentRBTree/tree/main/src/test/comparision_test/x86_result">x86_result/</a></li>
  <li><strong>Full technical write-up</strong> (dual-index correctness sketch, deeper hardware-counter analysis, epoch reclamation details): <em>arXiv preprint coming — link to be added.</em></li>
  <li><strong>Folly upstream proposal:</strong> <a href="https://github.com/facebook/folly/issues/2638">facebook/folly#2638</a></li>
</ul>

<p>If you try it on your workload, I would love to hear about it — especially results from people running large read-heavy caches on ARM or POWER. Those architectures are on my roadmap but I have no hardware access yet.</p>

<p>And if you are from the Folly team: <code class="language-plaintext highlighter-rouge">ConcurrentSkipList::NodeRecycler</code> is what the reclamation here is directly modeled on. This work would not exist without it. Thank you.</p>

<hr />

<p><em>Questions, bug reports, corrections — file an issue on <a href="https://github.com/Joey-Forever/ConcurrentRBTree/issues">GitHub</a>.</em></p>]]></content><author><name>Joey</name></author><category term="cpp" /><category term="concurrency" /><category term="data-structures" /><category term="performance" /><category term="folly" /><summary type="html"><![CDATA[A header-only C++17 concurrent ordered set that outperforms folly::ConcurrentSkipList by 1.5× on read-heavy workloads — why the red-black tree, and why it wins despite having worse cache hit rates.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://joey-forever.github.io/assets/throughput_16threads_8000000init.jpg" /><media:content medium="image" url="https://joey-forever.github.io/assets/throughput_16threads_8000000init.jpg" xmlns:media="http://search.yahoo.com/mrss/" /></entry></feed>