Hacker News new | past | comments | ask | show | jobs | submit login
Safety and Performance – Threadsafe Datastructures in C++ (sheep.horse)
91 points by todsacerdoti on May 23, 2022 | hide | past | favorite | 51 comments



I've spent a lot of time studying parallel compute over the past few years, trying to understand how GPU-algorithms are so fast.

Use barriers. Many data-structures are safe if:

1. Everyone is reading at the same time.

2. Only one thread is writing to any particular location.

How to accomplish these two facts? Again, use barriers. Take GPU-merge path algorithm for example. (GPU Merge-path is just a parallel-implementation of the "Merge-pass" of Merge-sort, normally taking O(N) time on sequential computers):

1. All threads read their relevant values from the array, and performs a binary search along the diagonal (read-only over the data-structure). Binary search is O(lg(n))

2. Barrier, all threads wait for all other threads to be done reading.

3. Write the "merge path".

4. Barrier, wait for everyone to finish writing.

5. Everyone reads the "merge path", which is all you need to figure out the "final location" of any particular value in your merge-sort in O(lg(n)) time.

6. Everyone writes their value "magically" to the correct, sorted, position of the array. Because everyone is writing to a different array location, there's no contention or race-conditions.

Since all steps are O(lg(n)), your Merge-path algorithm executes in O(lg(n)) time where n is the data-size and parallelism factor (ex: 10,000 element array has 10,000 processors), which is possible on modern GPUs.

Bonus points: all SIMD computation is innately barrier based. That's what SIMD means after all: all simd-lanes execute the same instruction at any given time. If your processor is on a "load" instruction, everyone's reading. If the processor is on a "store" instruction, everyone is writing. (Modern GPUs are MIMD though, and require explicit "barrier" instructions and/or kernel invoke calls if the SIMD-units from other compute-units are cooperating)

Your "only" job as the programmer, is to therefore, just ensure that those writes are to all different memory locations. A difficult job for sure, but doable on a wide variety of algorithms.

------------

Very, very few GPU-algorithms use mutexes or even atomics (!!!). The "bread and butter" is thread-barriers and enabling concurrent writes (by having everyone write to different locations of memory, as well as barriers to ensure the other threads are done reading or done writing).


> Use barriers. Many data-structures are safe if:

> 1. Everyone is reading at the same time.

> 2. Only one thread is writing to any particular location.

I think there is a missing point here:

3. Reads and writes do not happen at the same time.


If rules 1. and 2. are followed, 3. is implicitly true


Oh, right. If everyone is reading there is nobody left to write :)

It seems like quite a restriction, though.


See: https://news.ycombinator.com/item?id=31480323

> The easiest way to understand 100,000 concurrent threads is by making all 100,000 threads do the same thing and execute the same code.

If all your threads are off doing different things at different times, and executing different sets of code, you might have problems understanding your algorithm.

Its a restriction for sure, but it does lead to far simpler "thinking". There's still a very rich space of parallel algorithms that can be explored within this restricted space.

In effect: you probably *don't* want to write general-purpose "desynchronized" code unless you really have to. Keeping all 100,000 threads synchronized and in roughly the same spot of code really makes your brain think a lot easier.


How can reads and writes happen at the same time if everyone is reading at the same time?


Doesn't this only apply when all the threads are doing the same thing but on different data?


That's not a bad restriction in practice.

The easiest way to understand 100,000 concurrent threads is by making all 100,000 threads do the same thing and execute the same code.

EDIT: Works on the small scale too. If you're using a CPU, then you can assign exactly 128-threads to be your data-structure's "workers". You access the data-structure sequentially, but all manipulations are done in parallel by the 128x "worker" threads in the background.

Assuming a 64-core / 128-thread CPU like an AMD EPYC or whatever.

Normally, your data-structure "commands / frontend" aren't the bottleneck, instead is the data-structure's "work" that bottlenecks. As long as that work happens in parallel, you're probably fine.

------

Bonus points: this leads to "obvious" NUMA-locality. If your NUMA-cluster is just 8c/16-threads, you can have a NUMA-local set of 16-workers with affinity set. You scale to the level of your L3 cache (or whatever arbitrary NUMA-node you want), and different threads can have their "local workers" to choose from.


> That's not a bad restriction in practice.

Maybe, but it surely is inconvenient!

When our program can use pipe-line with several pipelines for concurrency (like GPU shaders work), sure, that's a good way to go. However the majority of what threads are doing is not pipe-line friendly work.

When we want our program to do different things at the same time (which is what we mostly use threads for), trying to do it in a pipe-line-like manner won't work.

For example, serving multiple http requests at the same time - some will be a POST to an API endpoint, some will be a GET for a static file, some controller code will make multiple requests to different backend microservices (each request in a different thread), etc.

Or maybe some client code, that makes several concurrent requests for different resources to different destinations, while each thread still needs to update the UI (progress meter, for example).

Or a parser for input from some upstream processes: you want to parse in parallel if at all possible.

Or the program in a microcontroller that cannot miss bytes coming in on one bus while it is sending bytes out on another bus.


Concurrent =/= parallel. CPU thread safety is mostly manifested [or let's say much more challenging] in context of concurrent, not parallel, algorithms. I'm not aware of anyone doing concurrent programming on the GPU (because it is a poor fit).


I'm not aware of anyone doing concurrent programming on the GPU (because it is a poor fit).

It's routine in good Vulkan-based games. One or more CPU threads are telling the GPU what to render, while other CPU threads are concurrently loading new content into the GPU for rendering on later frames. (You're driving down a road. Ahead of you, new content has to be loaded before you get there. Behind you, the memory containing content you can no longer see is released.)

This is complicated to make work. Unreal Engine does it, I think. Rend3->WGPU->Vulkan is supposed to do it, but right now there are locking delays which cause content updating to impact rendering. That's being worked on at the WGPU level.

This is one of those little problems being solved on the way to a big-world Metaverse.


Interesting, thanks.


Any concurrent algorithm, by definition, has branches of computation that are independent of each other, and is thus parallel.

Any parallel algorithm, by definition, is a (trivially) concurrent algorithm.


The usual way that the difference is presented is that concurrency is a semantic property of your program independent of its runtime performance while parallelism is a runtime property of your program independent of its semantics.

In particular, a basic property of concurrent programs is non-determinism. It must be always be available (can't be offline while serving one response) to service multiple requests in non-deterministic order (and often will give responses in a non-deterministic order as well even for a deterministic order of requests, or at least leave such an option available in its semantics). There are almost always other invariants that must be preserved (which is what is means for a concurrent program to be correct), but generally the invariants are lax enough that a concurrent program does not make guarantees about being deterministic. This has no direct relationship with its runtime implementation. For example you could implement a concurrent program in a single-threaded fashion just by interleaving and jumping back and forth between servicing different requests. This is effectively how something like green threads or async-await works.

Parallelism on the other hand is a property of the runtime. Indeed, often parallel programs are meant to be entirely deterministic in their semantics (e.g. map-reduce-type computations), where the answer will always be the same no matter what. You just might have faster runtime. But parallelism has no inherent link to non-determinism in the way that concurrency does, because after all determinism/non-determinism is a semantic property.


> But parallelism has no inherent link to non-determinism in the way that concurrency does, because after all determinism/non-determinism is a semantic property.

I don't know if I agree with this.

If we consider the map example, the order of elements to which the function is applied must be irrelevant, so you must have non-determinism at the most fundamental level. So map is still a concurrent algorithm, isn't it?

I mean, I get it, the meaning of the words is different, but I can't find any difference beyond "concurrency is a semantic property, parallelism is runtime manifestation of concurrency", which only makes them even more equivalent-ish.


> If we consider the map example, the order of elements to which the function is applied must be irrelevant, so you must have non-determinism at the most fundamental level.

This is not true. Parallelism does not require non-determinism (you can e.g. require as part of your interface that your list that you are map-reducing be pre-sorted and then using that sort key the order of operations is entirely determined, you could even pin certain sort keys to certain processor cores; this is in fact how some parallel interfaces for certain parallelism libraries operate, minus the pinning).

But that's only partially related to the larger point, which is that any discussion about the semantics vs the implementation of a program depends on where you draw the line between the semantics of your program and its runtime behavior. For different environments that line will be at different places. For most applications, runtime and memory usage are usually not considered part of its semantics (otherwise e.g. the notion of porting an application or refactoring wouldn't exist, since both of those would be changing the application's performance and therefore its semantics and would be essentially the same as changing the application's interface). In hard real-time environments, on the other hand, usually runtime and memory usage are considered part of its semantics, so the division between parallelism and concurrency is less relevant there.

> I can't find any difference beyond "concurrency is a semantic property, parallelism is runtime manifestation of concurrency", which only makes them even more equivalent-ish.

A very practical difference is that if you tell me a program is concurrent, but non-parallel I have a good idea of the outline what that program does. If you tell me that a program is parallel, but not concurrent, I also have a good idea of what that program is. And the two are very different and require very different engineering trade-offs.

Examples of the former include things such as chatbots, web servers, databases (usually the latter two will also sprinkle in parallelism), etc. Examples of the latter include compilers, physics simulations, rendering engines, etc.

And different tools are tuned for different use cases. If you use a tool meant for building concurrent, but not necessarily parallel programs, you'll have a bad time building a parallel, but not concurrent program with it and vice versa. E.g. Node JS is great for concurrent, but not parallel programs. You will have a horrible time building a parallel, but not concurrent program with it. None of its trade-offs will make sense. A similar case holds for something like Erlang (it at least has some support for parallelism, but it's not its strong suit). On the flip-side, Fortran is great for parallel but not concurrent programs, but you're not going find a lot of support for concurrent programs.


Yet we do distinguish the two types by distinct names (and algorithmic and data structure treatment) for a reason, precisely because your proposed equivalence offers little analytical and practical benefit.


> equivalence offers little analytical and practical benefit

I disagree, but that's irrelevant.

Considering things equivalent is default, if their properties are the same - in case of concurrency/parallelism, I think I have clearly demonstrated how every concurrent algorithm is parallel, and vice versa.

If you're claiming that a distinction should be drawn, the burden of proof is on you. What is the fundamental difference between concurrency and parallelism?


> I think I have clearly demonstrated how every concurrent algorithm is parallel, and vice versa.

You have not "clearly demonstrated" anything. You are merely pointing out a weak truism that holds in parts, which is the banal truth that all algorithms feature non-contended segments for a given executing thread. The fact that the distinction between these two types is not 'crisp' and precise does not mean a meaningful distinction does not exist.

If your position was valid, we would call e.g. a multi-producer multi-consumer queue a "parallel" data structure but we certainly don't. Sure, the producers are all producing in parallel before contending to enqueue, and the consumers do their thing in parallel after they contend to dequeue. That parallelism is a banal fact in this case, given that what we really need to address are the concurrent attempts to update the queue. Conversely, applying a filter to an image by tiling and performing the application in parallel has some minor contention bits but the main business is doing something in parallel.


Aside from the appeal to authority ("we" say X therefore X is right), you bring up some good points (although in a fairly unpleasant tone which I don't appreciate in the slightest), you still haven't answered my question - what is the fundamental difference between concurrency and parallelism?


Took a short walk and was thinking about this. This is my take on the matter.

The most fundamental difference between parallel and concurrent processing is that for the latter we need not have the same number of physical actors (threads) as the required logical actors [for meaningful/practical gains]. In other words, we can (and do, as in 'green threads') emulate the synchronized activity of nL logical actors with nR real physical actors, where 1 <= nR =< nL. Of course we can also emulate nL parallel activities with only 1 real actor but it is of no practical value.

So Amdahl's Law is very much about parallelism but it has absolutely nothing to do with 'concurrency'. [Whereas Queueing Theory is of relevance to analyzing concurrent algorithms.]

This is because in parallel algorithms, the concurrent actors are all doing the same thing (see SIMD), whereas in concurrent algorithms, the concurrent actors may be doing entirely different things (see typical GUI component of an OS).

This fact also informs why arrays and array processing languages are such good fits for parallel programming.

Then, it seems to me that parallel processing is orchestrated (deterministic) in nature, whereas concurrent processing is all about synchronized pseudo-chaos. [So parallelism is a simple, well ordered, type of concurrency. Think of it as symmetric and synchronized concurrency.]

Finally, the 'fuzzy' distinguishing matter is a sense of the degree of contention for mutating ops on registers. A concurrent algorithm is pretty much all about dealing with that: highly contended R/W registers.

p.s. so more examples of making this distinction: fork-join is about parallelism. green-threads/fibers are about concurrency. continuations are also about concurrency. Go is a language tuned for concurrency. Chapel is one tuned for parallelism.

p.s.s. things can be "embarrassingly parallel", but dealing effectively with concurrency is never something to be embarrassed about. /g

https://en.wikipedia.org/wiki/Embarrassingly_parallel


I see.

I think I understand better the motivation behind your initial comment that I've criticized. Thanks for taking time to explain.


NP. Actually it was good of you to insist on a clear definition. I would say applicability of Amdhal’s Law is a decisive litmus test.


As I understand it, parallelism describes tasks making progress simultaneously, while concurrency describes tasks that can be broken into smaller, independent tasks (that could still be executed serially).

For example, I could have map reduce problem executing sequentially on a single core CPU. Each map function would operate on the data sequentially, then the reduce step. The process is concurrent because I could work on any map step in any order, or even stop midway and work on another task set. Once all processes are complete I have my output. But since only one task was ever making progress at a time the computation wasn't parallel, it only becomes parallel once I add another execution unit.

Consider multitasking on early operating systems; a user may have several programs running concurrently, but early CPUs could only do one thing a time, hence there was no parallelism. Due to this architecture there are situations where there are all execution units are blocked and only one of them can make progress (e.g. reading from a spinning disk); making CPU synchronization primitives a "poor fit" for GPUs.

In other words, while any parallel algorithm, by definition, is a (trivially) concurrent algorithm, the inverse is not true, and is still a harder problem.


The "elevator algorithm" makes hard drives (and tapes) more-and-more efficient the more work is queued up for them.

Lets say you have 10 floors on an elevator (floor#0 through floor#9). You can imagine that I/O requests are handled by the elevator moving to a specific floor, and then reading/writing to that floor, and then moving to another floor.

Physically, this correlates to the positions a hard-drive arm is positioned on various sectors of a hard drive: you can think of floor#0 as the "inner-ring" of the hard drive, and floor#1000 as the "outer-ring" of the hard drive. The further away the floors are, the longer it takes for the arm to physically move across the hard drive.

----------------------

Now lets say there's a bunch of requests that have come in. Read Floor#5, write Floor#9, read Floor#1, write Floor#2, write Floor#7. When executing these 5 requests in order (5->9->1->2->7). This is clearly inefficient.

If all 5 requests were known at the same time, the hard drive can instead reorder the requests into 1->2->5->7->9, which moves the hard drive arm a much shorter distance. This is the innate benefits of "concurrency" of the 1980s era computers.

You wanted multiple-threads to buffer-up as many requests to the hard drive as possible. The more-and-more "load" you give to the hard drive arm, the more overall efficiency is achieved, because the hard drive controller can optimize those read/write requests and minimize the movement of the arm. (Ex: rearranging the "floor visit pattern" of the elevator algorithm to minimize movements of the hard-drive control arms)

----------------------

So we can see that "concurrency", or "multithreaded" programming was about maximizing I/O performance, instead of CPU-performance.

From this perspective, highly CPU-inefficient structures, such as Mutexes, are not really a penalty. The CPU is so much faster than the hard drive that it doesn't matter. Mutexes, task-switching, etc. etc. Spend all the CPU cycles you want on this, the CPU is obviously not the bottleneck here.

In contrast, today is 2020. We have solid state drives that are much much faster (still benefiting from parallel requests though!!), and the CPU itself can be the bottleneck.

Not only do we want to fill up our I/O systems with a huge number of parallel requests (modern SSDs can perform far, far more requests with high-queue depths), you also want to take advantage of multi-CPU clusters through parallel programming techniques.

-------------

As such, my particular usage of the words are:

* Concurrency are the sets of techniques to maximize I/O usage, even at high costs to CPU-time / CPU-power. (Mutexes, semaphores, pthreads, task-switching). If you have a technique that can "invent new I/O requests out of nothingness", that's ideal in the concurrency mindset. (Agents, golang-threads, etc. etc.)

* Parallelism are the sets of techniques to maximize CPU-usage, and therefore must be as low cost to CPU-power as possible. (Thread barriers, SIMD processing, spinlocks, thread-pools, thread-affinity, NUMA processing).

All of those goland-threads that generate more I/O requests are worthless in the scope of CPU-limited "parallel compute". But they're extremely helpful in I/O limited "concurrency". The more I/O requests that hit the controller simultaneously, the better and better those I/O devices perform. This is true for all forms of I/O, be it a GPU, Network adapter, hard drive, tape drive, or solid state drive.


Sorry about the tone!(?)


I apologize for my defensiveness. I tend to get defensive and emotional when under stress. I don't like that about myself.


One nitpick with this: when the author first uses a mutex to ensure safety, he says "the operations took over 50 times longer!"

Well sure. But "the operations" were single read and writes to a small array likely held in cache. For more complex data structs, it's easy to see that they would take ten times or more longer than reading from a fixed location -- leading to only a 5X slowdown. That might still be a lot, but I'm still going to nitpick the "50" comment :-)


This is why it isn’t called Amdahl's suggestion.

The use of std::mutex (especially in frequently accessed structures) will bring your application to a grinding halt. Better solutions abound with concurrent sequential processes, or the Actor architecture.


> The use of std::mutex (especially in frequently accessed structures) will bring your application to a grinding halt. Better solutions abound with concurrent sequential processes, or the Actor architecture.

I might be misunderstanding what you mean, but if heavy contention around a specific resource becomes a bottleneck, then putting it in an actor won't help you at all, because you've replaced the contention around the mutex with contention around accessing the actor's input queue.

The solution to contention is usually some kind of replication. So for example, if you have a shared hashmap which is protected by a mutex, the solution isn't to put the hashmap into an actor, but to split it off into, say, 16 hashmaps, and partition the entries according to the lower 4 bits of the hash.


Anyone studying Amdahl's law should also study the inverse, literally 1/(Amdahl's law), which leads to Gustafson's law.

Its really strange what a simple reciprocal does to the entire equation. Instead of trying to cut your time down and perform work faster, Gustafson's law holds "time as constant" and measures the "amount of work done".

Lo and behold: the parallel-portion of work grows the more-and-more work is available. That is to say, if you double-the-data to process, you can almost always double-the-parallelism in the same amount of time.

------------

So, you have some model (neural net?) that's difficult to parallelize? No problem, simple solution. Double, triple, quadruple the size of the model. Done. Suddenly more of it can be performed in parallel.

Shrinking time (ie: Ahmdal's law) is an exercise in futility, and uselessness. It turns out that a lot of people's work days are measured in hours, and going from 30-seconds of wait time to 10-seconds of wait time is not really a crazy benefit. Instead, going from "Analyzed 2-GBs of data in 30-seconds" into "Analyzed 6GBs of data in 30-seconds" does lead to improvements.

> The use of std::mutex (especially in frequently accessed structures) will bring your application to a grinding halt. Better solutions abound with concurrent sequential processes, or the Actor architecture.

This is a separate issue. Contention is O(n), where n is the amount of contention. 1000 threads writing the same variable can only be done one-at-a-time. So it will take O(1000) time to perform all those reads/writes. It doesn't matter if you use mutexes, atomic-compare-and-swap, or anything. You can't "beat" the laws of physics. 1000 updates requires 1000x time.

You "solve" this by splitting up your variables and minimizing contention. Instead of 1000x writes to 1x variable, you can perform 1000x writes to 1000x variables, then read the 1000x variables in parallel one step later.


It will only bring your application to a grinding halt if it's on the critical path. Most data in most programs can be protected with a plain old mutex and everyone is happy with the performance. Even on the hot path you may get trivial scaling to 2 or 4 threads with a global mutex protecting critical data, as we see with memcached. Sharing doesn't start to offend Amdahl until you have many threads and the critical section is relatively costly compared to the rest of the program.


Worth calling out that the introduction of multiple locks can lead to deadlocks etc.

It would actually make for a nice library to have lock acquisition decoupled from lock scope in the caller. Eg every call to lock acquire passes some identifier, and the lock implementation decides what it will do with that identifier eg global lock, scoped lock etc


I find that thread safety annotations in clang help prevent deadlocks when there’s a tricky section.


> The use of std::mutex (especially in frequently accessed structures) will bring your application to a grinding halt.

Not necessarily. It really depends on your problem domain and how you architected your solution.

For example, a basic producer-consumer can suffer if you implement it with a common non-thread safe task queue where all consumers and producers have to lock a mutex to push/pop. However, adopting a queue whose pushes and pops can be performed independently without any contention, and using work-stealing consumers that manage their own work queue are enough to eliminate contention to a residual level.

It all depends on your problem domain and how much time and effort and thought you're willing to invest in a problem.


Agreed.

The problem with mutex is that it's often hard to get right, since it's a low-level primitive for serializing access to shared memory, and it does not compose (cannot atomically lock two mutexes).

CSP/Actor model simply takes the problem that mutex solves on the low level - concurrency - and makes the whole world concurrent-by-default, so that mutexes are not needed.

Note the 1st Go proverb [0]:

    "Don't communicate by sharing memory, share memory by communicating."
[0] https://www.youtube.com/watch?v=PAAkCSZUG1c&t=168s


Sure you can compose mutexes: https://en.cppreference.com/w/cpp/thread/lock

In practice, if you have a global total ordering of mutexes, it's safe to lock any number of them provided you lock them in order.

CSP is interesting but it doesn't apply in a lot of cases, e.g. designing a lock striping mechanism over a database as described in the article would not be very fun with CSP or actors, as the article calls out at the bottom.


> In practice, if you have a global total ordering of mutexes, it's safe to lock any number of them provided you lock them in order.

Very interesting! Such a simple and elegant solution.


Amdahl's law applies regardless of whether you use shared memory synchronization, CSP or Actors. CSP and Actors are easier to think about, but they aren't inherently faster.


Yes.

What's often ignored is that inter-CPU communication mutexes rely on is a limited resource in itself. So heavy use of mutexes can slow down other unrelated processes as well.


A very nice approach for read heavy structures is the Left-Right algorithm.

http://concurrencyfreaks.blogspot.com/2013/12/left-right-cla...

It does cost an extra instance, but it generalizes to any data structure and gives you wait-free reads.


It's a fine algorithm, but in left-right those atomic increments/decrements on read can be costly, up to some 200x slower than the read itself. The "wait until" steps in the writer can also be an issue if the atomic decrements in the reads are not signalling writer wakeups, in a pre-emptive scheduling environment.

The description doesn't mention the memory barriers that are also required, and which are also a little costly on some architectures.

This is still better than a mutex or reader-writer lock for performance. (But it has different read-after-write behaviour than a mutex, so they aren't directly comparable; sometimes you still need a mutex).

Where possible and optimising specifically for wait-free read performance, I would aim for RCU-style reads for speed.

RCU reads are as fast as normal reads, because they are atomic-free as well as wait-free, and on almost all architectures do not require barrier instructions either. However the writes are more expensive and sometimes a lot more complicated. It is possible to combine left-right and RCU into a hybrid algorithm that uses fixed memory with grace periods, to combine the no-allocation property with the fast atomic-free reads property.


Seems like it starts with some examples of what doesn't work well and then teases the reader with potential solutions, which don't have examples.


It does read like at some point the OP got tired of writing and just wrapped it up with a to-do list.


Ha, this is my article. Believe me when I say I had developed a truly remarkable solution of this problem which this blog post is too small contain.

More believably, this is a very basic introduction to the problem. More complex solutions are possible but are very tied to the specific datastructure and access patterns. It is hard to generalize advice once the problems get more involved so I just declared them out-of-scope and only mentioned some ideas in passing.

Maybe I will write a part II sometime. The comments here are interesting.


Sounds like the famous "How to Draw an Owl" tutorial. ;)


I'm surprised they didn't try std::atomic_long or a compare-and-swap.


why? it doesn't matter. that'll will only get you so far, as you scale, you'll still run into synchronization contention.

Maybe going from 16 to 64 threads you'll get some nice wins, but so what? What are you going to do when you need to double the number of threads again?


Not much in the way of new stuff here. Good use of mutexes are the norm. But I was thinking they’d show some novel atomic or lock free algos


The story is managable as long as we use local operations, like read/write a value. Once we need iterators (read "aggregate operations") life becomes tough. In this case only copy-on-write strategy seem to work, doesn't it ?


Transactional memory would presumably work...if someone bothered to implement it properly.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: