how is this done in rust or c?

bottled_poe · on June 18, 2023

This is not a language problem. It’s an algorithm design problem. There’s no silver bullet. The basic principle is to divide the problem space into independent blocks of work. How to achieve that depends on the problem.

cyber_kinetist · on June 18, 2023

For a good example, there's a tricky parallelization problem in physical simulation, which you have update edge/triangle/bending wing forces in a mesh structure without any race conditions. (This becomes especially thorny if you want to parallelize your algorithm to the GPU.) A surprising solution for this is graph coloring, where you "color" each element without having two elements that interfere with each other the same color. Then you can safely parellelize the updates of all elements inside each color group, since the same color guarantees absolutely no interference.

fooker · on June 18, 2023

Algorithm design problems, when general enough, warrant being treated as language problems!

Every feature of programming languages started in this fashion.

bottled_poe · on June 18, 2023

Indeed, the holy grail (or perhaps the rapture) of programming languages is a compiler which generates the entire program with zero human-written code. I fear it may be on the horizon.

inopinatus · on June 18, 2023

There are programming languages e.g. occam designed with the intention of exposing parallelism algebraically, but I wouldn't call them a silver bullet either.

j16sdiz · on June 18, 2023

They are great in theory, but sucks in practice.

We don't have good optimising compiler for that

inopinatus · on June 18, 2023

For something small like a counter you can use thread_local since C11, but for substantial parallelized computation, designing the division of work to avoid shared writes typically entails a scheduling function that parcels up the work, sets up memory allocation in advance to avoid conflict, and then hands an entirely unshared execution context for each thread to the thread start function, most likely as a pointer to a app-specific struct (since that is what pthread_create allows for), and then subsequently applies a combination operation in the thread reaper loop to collate results (extra brownie points accrue when a reduce function is written to vectorize).

The memory allocator plays a significant role, since allocation strategy needs to be per-thread-/per-CPU-cache-aware. Choosing and then tuning a different malloc (e.g. tcmalloc, jemalloc) to the one in your platform's default library is a non-trivial matter but may have enormous impact both on overall performance and memory demand.

In addition, when you design computation this way it is relatively easy to hadoopify it later, since it's basically map-reduce writ small.

flatline · on June 18, 2023

MPI partitioning by rank? I’m curious what other solutions there may be.

pitaj · on June 18, 2023

In rust, you can usually use the rayon library which handles partitioning and scheduling for you.

MuffinFlavored · on June 18, 2023

that's just multi-threading in general

i feel like the post i was responding to was talking about handling/pinning a thread to a specific CPU core?