The key effect of Robin Hood hashing is just that it gives you confidence and/or hubris to push a table to 90% capacity, which greatly exacerbates the problem.
If you push a hash table which handles collisions by linear probing that hard, and any non-randomness appears, you can get long runs and are in trouble. I knew about Robin Hood hashing, but didn't know it was being used to justify 90% table density. Usually, above 50% full and it's time to double the size of the table.
What does the Rust implementation do for deletions? The classic solution is to mark the deleted cell as available for reuse, but non-empty for the linear part of the search during lookup. If you do that, at some point you have to compact the table or recopy it or something. This is an argument for a hash table with linear lists for each hash slot. More memory allocations and cache misses, but better behavior as the table fills up.
(Remember the YC discussion in which some job interview question insisted that hash tables operated in constant (O(1)) time? Not quite. There's collision overhead and growth overhead.)
> If you push a hash table which handles collisions by linear probing that hard, and any non-randomness appears, you can get long runs and are in trouble. I knew about Robin Hood hashing, but didn't know it was being used to justify 90% table density. Usually, above 50% full and it's time to double the size of the table.
wouldn't non-randomness be an issue for any hash table, regardless of how full it is? doubling it will not make the input more random, theoretically we could have the same slowness whether it's a smaller 90% full table, or a bigger one "just" 45% full - if N things hash to the same loc, we'll have to dig through O(N) spots till we find what we need.
i wrote a small hash set for my side project (https://github.com/jbakic/Shielded/blob/master/Shielded/Simp..., does not need to support removal so it's the dumbest linear probing hash table one could probably imagine), and there i set the limit at 75%, even though i don't use Robin Hood (perhaps i should?), because the lib has to iterate through the whole set at the end of a transaction, at least twice. i know this is not very relevant, a very special use case, but i'm giving it as an example of how a bigger load factor can actually be advantageous. the Shielded lib is overall slower if the load factor limit is set lower.
i hope someone more versed in the theory behind all this can add something to this, i'm curious if my thinking really makes sense...
Quadratic complexity is bad, but when I saw the example code I automatically thought "that's definitely not a good way to make a copy". I'm not familiar with Rust, but I assume the class contains a special cloning method which does something more like a memcpy()? If so, if you want to make a slightly modified copy of a hashtable, it's probably still faster to copy the whole thing in one go and then add/remove the desired elements than insert each one and do the filtering element-wise.
> when I saw the example code I automatically thought "that's definitely not a good way to make a copy"
Sure but that's not really the point, it's just an easy way to trigger the issue. If you look at the original issue[0], the initial use case is merging two maps:
let first_map: HashMap<u64, _> = (0..900000).map(|i| (i, ())).collect();
let second_map: HashMap<u64, _> = (900000..1800000).map(|i| (i, ())).collect();
> The user wants to merge the hash maps, and does so naïvely,
let mut merged = first_map;
merged.extend(second_map);
> Time for merge when second_map is shuffled: 0.4s
> Time for merge when second_map is not shuffled: 40.s (x100 amplification)
The example in TFA is a simplified/boiled down to the barest version exposing the issue from a more real-world case of "was transforming one HashMap into another (with natural language words as keys), iterating over the keys of the first HashMap, inserting them in the second with different values."
> I'm not familiar with Rust, but I assume the class contains a special cloning method which does something more like a memcpy()?
You can't memcpy the hashmap since it could contain non-trivial keys or values (e.g. refcounted heap allocations)[1], but you could clone() it for a one-shot copy. As noted above you're missing the forest for the trees though, that's just an easy trigger case, the point is not cloning maps.
I'm not familiar with Rust, but I assume the class contains a special cloning method
Yes, the Clone trait is implemented for HashMap. These are simplified examples to show the problem. I posted the example on which the example in the blog was based [0]. In the actual setting where I encountered this I wasn't copying a Hash{Map,Set}, as one would obviously use Clone. I was constructing a new HashMap that had the same keys as the original HashMap, but different values (both in value and type). Since this is quite a common scenario, quadratic complexity is bad.
Any algorithm you pick is going to have edge cases, and the C++ STL only defines the interfaces for interacting with them. This means that there is no "typical" implementation. Do you actually know how the implementation you're currently using is written?
I might be wrong but I thought the STL design docs also had restrictions on algorithm speed? As in - you are free to choose any design you want, as long as it has a given API and has O(1) access.
It's been a while, so my memory could be playing tricks on me. However, I believe the specified std::unordered_map iterator invalidation behavior rules out efficient use of open addressing unless someone comes up with some timsort-level novel implementation ideas.
I know some of the core algorithms used by GNU's implementation (libstdc++). But what the Rust team could do is compare the performance of different implementations, and pick the best one for each data structure.
... which they have done, for some definition of "best".
They've chosen an implementation with good cache locality and lower memory overhead (and presumably better average-case speed) than the chaining implementation usually used for std::unordered_map. The downside is that its worst-case speed is pretty bad.
On the other hand, I believe they use Siphash by default, which means it's very difficult to even intentionally trigger worst-case behavior without being able to inspect the process's memory for the random Siphash key.
std::unordered_map isn't particularly good. Its strength is that it can handle nearly all workloads without having abysmal performance. The price paid for that is that it's never actually all that good. If you know your workload, a "specialised" hash table can easily be 2x faster (e.g. Google's dense_hash_map).
This is because the standard makes some very specific complexity guarantees on operations and iterator validity. This practically forces standard-compliant implementations to use hashing with chaining, and not in the most efficient way. That's why it's universally bad and not just a bad implementation in any particular standard library. I don't remember the exact details, though.
But we are talking about a standard hash table here. I.e., one that should have consistent performance for general problems, and no surprising behavior whatsoever.
Imho, if the user wants better performance for special use cases, then they can search for better algorithms.
Despite the downvotes, I think you're absolutely right. "No surprises, okay performance" is what a standard library data structure or algorithm should strive for. Sometimes that works out really well for both criteria (libstdc++'s std::sort), but other times performance suffers as "no surprises" is more important (std::unordered_map).
If you push a hash table which handles collisions by linear probing that hard, and any non-randomness appears, you can get long runs and are in trouble. I knew about Robin Hood hashing, but didn't know it was being used to justify 90% table density. Usually, above 50% full and it's time to double the size of the table.
What does the Rust implementation do for deletions? The classic solution is to mark the deleted cell as available for reuse, but non-empty for the linear part of the search during lookup. If you do that, at some point you have to compact the table or recopy it or something. This is an argument for a hash table with linear lists for each hash slot. More memory allocations and cache misses, but better behavior as the table fills up.
(Remember the YC discussion in which some job interview question insisted that hash tables operated in constant (O(1)) time? Not quite. There's collision overhead and growth overhead.)