Some of this seems focused on working around broken primitives and frameworks.
Concurrency primitives in Java will not go to the kernel unless a thread actually has to park, and if you use an executor service or similar threads will not go to sleep if a task queue is not empty. I suspect pthread primitives are also careful to avoid coordination when none is needed.
I don't actually see how a task queue would unschedule a consumer for each element if the queue is not empty. What is going to wake up the thread that is blocking on a non-empty queue? I guess I am spoiled by having queues solved well in a standard library.
Context switching is also something that comes up a lot as the sort of thing that is expensive. Context switching is only expensive if task size is small relative to the cost of a context switch which isn't that large. My experience is that for small non-blocking tasks you can run thread per core to expose the available parallelism and everything else will tolerate context switching.
I am also perpetually hearing about the importance of hot caches and not switching threads. Caches are just tracking state for tasks, and unless you have actually done something to create locality between tasks there is nothing to make them stay hotter anyways.
If the state the cache is tracking is multiple thread stacks, well... the CPU doesn't know the difference between data on a stack and data that it is chasing through some pointer.
The real problem is having a task migrate to a different CPU instead of waiting its turn in the right spot and that can be solved other ways.
Access pattern matters as well. If you are going to sequentially process buffers then prefetching will work and there is no benefit to a hot cache. That is where the emphasis on zero copy tends to show holes. Think about the difference in speed between RAM and a network or disk interface, and how much processing you are going to do beyond just the copying.
My main beef is that pushing this kind of performance thinking without providing measurements that show where it does and doesn't matter encourages the kind of premature optimization that isn't productive.
"Caches are just tracking state for tasks, and unless you have actually done something to create locality between tasks there is nothing to make them stay hotter anyways."
This is only true if your data is small enough that it never gets evicted. Otherwise, there are certainly things that can make them stay hotter without involving multiple cores: You can shrink your data, reorder your data, or reorder your traversal of your data.
Makes me suspicious. See http://mechanical-sympathy.blogspot.com/2013/02/cpu-cache-fl...
Some of this seems focused on working around broken primitives and frameworks.
Concurrency primitives in Java will not go to the kernel unless a thread actually has to park, and if you use an executor service or similar threads will not go to sleep if a task queue is not empty. I suspect pthread primitives are also careful to avoid coordination when none is needed.
I don't actually see how a task queue would unschedule a consumer for each element if the queue is not empty. What is going to wake up the thread that is blocking on a non-empty queue? I guess I am spoiled by having queues solved well in a standard library.
Context switching is also something that comes up a lot as the sort of thing that is expensive. Context switching is only expensive if task size is small relative to the cost of a context switch which isn't that large. My experience is that for small non-blocking tasks you can run thread per core to expose the available parallelism and everything else will tolerate context switching.
I am also perpetually hearing about the importance of hot caches and not switching threads. Caches are just tracking state for tasks, and unless you have actually done something to create locality between tasks there is nothing to make them stay hotter anyways.
If the state the cache is tracking is multiple thread stacks, well... the CPU doesn't know the difference between data on a stack and data that it is chasing through some pointer.
The real problem is having a task migrate to a different CPU instead of waiting its turn in the right spot and that can be solved other ways.
Access pattern matters as well. If you are going to sequentially process buffers then prefetching will work and there is no benefit to a hot cache. That is where the emphasis on zero copy tends to show holes. Think about the difference in speed between RAM and a network or disk interface, and how much processing you are going to do beyond just the copying.
My main beef is that pushing this kind of performance thinking without providing measurements that show where it does and doesn't matter encourages the kind of premature optimization that isn't productive.