The point is specifically about parallel vs sequential programs. Legacy C code i...

arghwhat · on May 1, 2018

> Legacy C code is sequential, and the C model makes parallel programming very difficult.

Neither of these statements are true, unless "Legacy" refers to the early days of UNIX.

Tasks that parallelize poorly do not benefit of many small cores. This is usually a result of either dealing with a problem that does not parallelize, or just an implementation that does not parallelize (because of a poor design). Neither of these attributes are related to language choice.

An example of something that does not parallelize at all would be an AES256-CBC implementation. It doesn't matter what your tool is: Erlang, Haskell, Go, Rust, even VHDL. It cannot be parallelized or pipelined. INFLATE has a similar issue.

For such algorithms, the only way to increase throughput is to increase single-threaded performance. Increasing cores increase total capacity, but cannot increase throughput. For other tasks, synchronization costs of parallelization is too high. I work for a high performance network equipment manufacturer (100Gb/s+), and we are certainly limited by sequential performance. We have custom hardware in order to load balance data to different CPU sockets, as software based load distribution would be several orders of magnitude too slow. The CPU's just can't access memory fast enough, and many slower cores wouldn't help as they'd both be slower, and incur overheads.

Go and Erlang of course provide built-in language support for easy parallelism, while in C you need to pull in pthreads or a CSP library yourself, but the C model doesn't make parallel programming "very difficult", nor is C any more sequential by nature than Rust. It is also incorrect to assume that you can parallelize your way to performance. In reality, the "tons of small cores" is mostly just good at increasing total capacity, not throughput.

dgreensp · on May 2, 2018

I admit it's not fair to blame C in particular. The comparison is between how we write and execute software and how we could write and execute software, and the language absolutely comes into play, in addition to how the language is conventionally used. "Legacy" code in this context is code that was written in the past and is not going to be updated or rewritten.

I disagree that tasks performed by a computer either don't parallelize or the cost of synchronization is too high. At a fine-grained level, our compilers vectorize (i.e. parallelize) our code -- with limits imposed by C's "fake low-levelness" as described in the article -- and then our processors exploit all the parallelism they can find in the instructions. At a coarser level, even if calculating a SHA (say) isn't parallelizable, running a git command computes many SHAs. The reasons why independent computations are not done on separate processors -- even automatically -- come down to programming language features (how easy is it to express or discover the independence, one way or another) and real or perceived performance overhead. Hardware can be designed so that synchronization overhead doesn't kill the benefits of parallelization. GPUs are a case in point.

The world is going in the direction of N cores. We'll probably get something like a mash-up of a GPU and a modern CPU, eventually. If C had been overtaken by a less imperative, more data-flow-oriented language, such that everyone could recompile their code and take advantage of more cores, maybe these processors would have come sooner.

arghwhat · on May 2, 2018

Rant time.

> "Legacy" code in this context is code that was written in the past and is not going to be updated or rewritten.

In that case, I would not say Legacy code is sequential. For the past few decades, SMP has been the target where sensible/possible.

> At a fine-grained level, our compilers vectorize (i.e. parallelize) our code.

Vectorization is a hardware optimization designed for a very specific use-case: Performing instruction f N times on a buffer of f_input x N, by replacing N instantiations of f by a single fN instance.

If this is parallelization, then an Intel Skylake processor is already a massively parallel unit, which each core already executing massively in parallel by having the micro-op scheduler distribute across available execution ports and units.

In reality, vectorization has very little to do with parallelization. Vectorization is much faster than parallelization (in many cases, parallelization would be slower than purely sequential execution), and in a world where all the silicon budgets goes to parallelization, vector instructions would likely be killed in the process. You can't both have absurd core counts and fat cores. If you did, it would just be adding cores to a Skylake processor.

(GPU's have reduced feature sets compared to Skylake processors not because they don't want the features, but because they don't have room—they just specialize to save space.)

> At a coarser level, even if calculating a SHA (say) isn't parallelizable, running a git command computes many SHAs.

And this is exactly why Git starts worker processes on all cores whenever it needs to do heavy lifting.

This has been the approach for the past few decades, which is why I twitch a bit at your use of "legacy" as "sequential": If a task can be parallelized to use multiple cores (which is not a language issue), and your task is even remotely computation expensive, then the developer parallelize the problem to use all available resources.

However, if the task is simple and fast already, parallelization is unnecessary. Unused cores are not wasted cores on a multi-tasking machine. Quite the contrary. Parallelization has an overhead, and that overhead is taking cycles from other tasks. If your target execution time is already met on slow processors in sequential operation, then remaining sequential is probably the best choice, even on massively parallel processors.

Git has many commands in both those buckets. Clone/fetch/push/gc are examples of "heavy tasks" which utilize all available resources. show-ref is obviously sequential. If a Git command that is currently sequential ends up taking noticable time, and is a parallelizable problem (as in, computing thousand independent SHA's), then the task would be parallelized very fast.

Unless something revolutionizing happens in program language development, then it will always be an active decision to parallelize. Even Haskell require explicit parallelization markers, despite being about as magical as programming can get (magical referring to "not even remotely describing CPU execution flow").

> Hardware can be designed so that synchronization overhead doesn't kill the benefits of parallelization. GPUs are a case in point.

I do not believe that this is true at all. That is, GPU's do not combat synchronization overhead in the slightest, lacks features that a CPU use for efficient synchronization (they cannot yield to other tasks or sleep, but only spin), and run at much lower clocks, emphasizing inefficiencies.

After reading some papers on GPU synchronization primitives (this one in particular: https://arxiv.org/pdf/1110.4623.pdf), it would appear that GPU synchronization is not only no better than CPU synchronization, but a total mess. At the time the paper was written, it would appear that the normal approach to synchronization were hacks like terminating the kernel entirely to force global synchronization (extremely slow!) or just using spinlocks, which are way less efficient than what we do on CPU's. Even the methods proposed by that paper are in reality just spinlocks (the XF barrier is just a spinning volatile access, as GPU's cannot sleep or yield).

All this effectively make a GPU much worse at synchronizing than a CPU. So why are GPU's fast? Because the kind of tasks GPU's were designed for do not involve synchronization. This is the best case parallel programming scenario, and the scenario where GPU's shine.

I'd also argue that if GPU's had a trick up their sleeve in the way of synchronizing cores, Intel would have added it to x86 CPU's in a heartbeat, at which point synchronization libraries and language constructs would be updated to use this if available. They don't hesitate with new instruction sets, and the GPU paradigm is not actually all that different from a CPU.

> The world is going in the direction of N cores. We'll probably get something like a mash-up of a GPU and a modern CPU.

It's the only option, due to physics. If physics didn't matter, I don't think anyone would mind having a single 100GHz core.

However, it won't be a "mash-up of GPU and a modern CPU", simple due to a GPU not being fundamentally different from a CPU. A GPU is mostly just have different budgeting of silicon and more graphics-oriented choice of execution-units than a CPU, but the overall concept is the same.

> If C had been overtaken by a less imperative, more data-flow-oriented language, such that everyone could recompile their code and take advantage of more cores, maybe these processors would have come sooner.

A language that could automatically parallelize a task based on data-flow analysis (without incurring a massive overhead) would be cool. I don't know of any, though. I seems optimal for something like Haskell or Prolog, but neither can do it.

However, tasks that would benefit from parallelization would already be easy to tune to a different amount of parallelism, and parallelizing what is poorly parallelized is not useful on any architecture.

Parallelization hasn't really been a problem for at least the last two decades, and I certainly can't see it as the limiting factor for making massively parallel CPU's. However, massively parallel CPU's are not magical, and many problems cannot benefit from them at all. It will almost always be trading individual task throughput for total task capacity.