"I wrote two versions of each. One directly iterates over a []int, the other int...

btilly · on Sept 8, 2013

Cache coherence is when data that you need close in time is physically close in memory, and therefore loading the first piece of data loads the second into cache so that it will be quick to load. The performance difference between fetching from CPU cache and main memory is about a factor of 10, so this is a significant effect.

When you're writing performance critical code this is a good thing to have happen, and it can be worth a lot of effort to make sure that you benefit as often as possible. However general purpose code usually does not wind up with data that is optimally organized in memory. But in a benchmark the usage patterns are simple enough that it is trivial to get well-organized data in memory without even intending to. Whether this is likely to be predictive of real world performance depends on how carefully coded the real world program is.

bad_user · on Sept 8, 2013

Cache coherence can also be a problem for performance, leading to what is called false sharing.

When reading some value in RAM, values in the same "cache line" (that are right next to it) get cached. However, if another thread has to modify those values that just got cached, then the cache needs to be invalidated. This often happens in producer/consumer implementations.

If this is discovered to be a problem, it's often solved with what is called cache padding. I.e. if you've got variables A and B that are right next to each other and they are modified from different threads, then you add fake variables (padding) between them that's the size of a cache-line (which is between 32 bytes and 256 bytes, depending on architecture).

qznc · on Sept 8, 2013

The author is actually a little bit confused here.

Cache coherence means that CPU cores with separate caches, but accessing the same RAM have a coherent view. Usually implemented on hardware level with cache coherence protocolls (MESI etc).

What the author wanted to say: The goals was to hopefully avoid cache misses. With []int the values are sequential in memory, so the probability is high that the next value is already loaded into the cache. With []*struct there is a sequential array of pointers which have to be dereferenced. This dereference means random access to memory locations all over the RAM. Probably more cache misses.

The author is correct to measure this, because cache miss patterns are hard to predict and this optimization might not be worth it.

ewencp · on Sept 8, 2013

Ah, sorry, I was using it in the locality sense (as in, coherent access). Admittedly less common, but not uncommon, in algorithms; downright confusing if you're in an architecture frame of mind.

ewencp · on Sept 8, 2013

Testing performance by iterating over an array can be dangerous because all the data is conveniently grouped together such that the processor will have the next data it needs in cache once it's loaded the first item (most of the time; when it crosses cache lines it'll cause another miss, then a bunch more hits). This can lead to huge performance boosts, but rarely reflects how data is laid out in memory unless you're very lucky.

The change made was to still iterate over a slice (i.e. an array), but the array now contains pointers to structs, which are allocated separately. This should scatter them across memory such that each step of the iteration needs to choice a different pointer to a different chunk of memory. This is a benchmarking issue which has bitten me in the past, so I made at least a feeble attempt to account for it.

However, I actually don't know enough about Go's memory allocation to really make sure I could force the conditions I wanted, so it's still possible all the data was lined up nicely, but just spread a bit farther apart. (In fact, this may even be likely since the allocation approach was very simple. Trying to do a lot of allocations and deallocations and filling in the data over a longer period would probably do a better job of addressing this concern.)