Sort of off topic: In theory hash tables are supposed to be O(1) (under the RAM ...

pdpi · on May 29, 2018

The biggest divergence between what's taught in school and what happens in practice is that the Big O limits you saw in school use a different abstract machine relative to what we use today, in practice. Specifically, I'm talking about the assumption that random memory access is O(1), where in reality modern machines' cache hierarchy makes it behave more like O(log n) or O(sqrt n) or something to that effect.

OskarS · on May 29, 2018

Is that really true, though? It's certainly the case that a cache miss takes longer than a cache hit, but the time for a cache miss doesn't depend on how much memory you have. Or does it? Like, a cache miss isn't going to be slower if you have 64 gigabytes of memory compared to having 1 gigabyte of memory, it's just a cache miss. That makes random memory access O(1) in the sense of complexity analysis.

cjhanks · on May 29, 2018

There is typically L1 cache, L2 cache, L3 cache, RAM, and SWAP. These ranges from kb to mb, and then to gb. And each lower level is faster than all levels above it.

Most of the relevant memory here is L2 and L3. As you have more items in memory, the statistical likelihood that the page you're requesting will be in an efficient cache level decreases. Eventually you have so many items that you're statistically only getting the performance of RAM (let's ignore the swap case)... that's the point where you'll get closer to a flat `O(1)`... but it will be a lot slower than the `O(1)` with more cache hits.

avar · on May 29, 2018

All true, to add to this there's all sorts of other caveats on modern processors. E.g. it may be slower to access any of L1-3 or RAM depending on what processor core you're on.

This is because even though your program can transparently access all that memory, the underlying machine is implemented in terms of memory areas and it can be costly to "jump across" a memory boundary[1][2][3].

1. https://linux.die.net/man/8/numactl

2. https://linux.die.net/man/1/taskset

3. http://iopscience.iop.org/article/10.1088/1742-6596/664/9/09...

pdpi · on May 29, 2018

More RAM doesn't mean more cache misses, but more elements in your data set does — "n" isn't the amount of memory in your system, it has the same meaning here as it does everywhere else in complexity analysis.

Small datasets fit in L3, smaller datasets fit in L2, and if they're smaller still they'll fit in L1, all of which are progressively faster. If your data set is bigger than L3, you need to hit RAM, and bigger still you'll need to hit the SSD or HDD — going progressively slower.

OskarS · on May 29, 2018

I mean, yeah, obviously, I'm not disputing that, but the point is that it's still "constant time", it doesn't scale with the amount of memory.

My point wasn't that it doesn't get slower if you have larger datasets: of course there's a change when you go from cache-hits to cache-misses. My point was that trying to say "it has higher complexity than O(1)" is an inaccurate way of describing that slow-down. "O(1) random memory access" doesn't mean "fast memory access", it means "memory access where the time is bounded by a constant".

smallnamespace · on May 29, 2018

That's because people use big-oh notation when they really mean something close to theta-oh and are looking for a tighter average case bound (e.g. any graph where you benchmark 'average' case data).

E.g. for almost any commonly used algo you could say it is O(e^n) and you would be technically correct as far as what big-oh demands.

O(1) is indeed an upper bound on memory access, but in the presence of cache it is not the tightest possible bound, hence one reason for divergence against numerical estimates given by rough big-oh bounds.

pdpi · on May 29, 2018

> O(1) is indeed an upper bound on memory access, but in the presence of cache it is not the tightest possible bound, hence one reason for divergence against numerical estimates given by rough big-oh bounds.

Quite the opposite — O(1) is in fact too tight.

0xffff2 · on May 29, 2018

We're really talking about a system with finite memory, and taking O(1) to be equal to log(n) when n is such that the data entirely fills the memory. Under these conditions, O(log(n)) gives a tighter and more accurate bound, but at the expense of not being strictly correct (i.e., it is not truly an asymptotic bound.)

bzbarsky · on May 29, 2018

It all depends on what you want your constant to be.

If your constant is "the worst-case time it takes to swap in the memory from my disk", and your dataset is guaranteed to fit in local disk then memory access is in fact O(1). With a really bad constant. But in practice people care about how much better than that you can do...

Now technically the "guaranteed to fit in local disk" assumption means you're really no considering asymptotic behavior. But then the question is what one really means by asymptotic behavior in this case. What do you consider your "memory access" latency to be once your dataset is too big for the sum total of all existing storage media?

myWindoonn · on May 29, 2018

Yes. Here is a four-part series exploring RAM models: http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...

coldtea · on May 29, 2018

You should see it from the other side: a cache miss takes longer the bigger the data you have to fit in your cache (as you go through bigger and slower cache levels).

So, as your N (length of data) changes, access time changes too.

adrianN · on May 29, 2018

You just need to use a machine model that takes into account the cost of address translation and the TLB. Check out https://arxiv.org/abs/1212.0703 for some introduction to the topic.

Nokinside · on May 29, 2018

You should think memory as fixed length continuous blocks of different sizes.

Inside the processor there are cache-lines (usually 64 bytes). They are blocks of memory that are are tagged by CPU at once and moved together.

In the typical n-way set associative cache architecture the main RAM memory is divided blocks with n-lines each. Each set on the memory cache can hold up to n-lines from the same memory block. 8-way cache would divide 1 GB RAM to 1,024 1 MB blocks. If you work with more than 512 bytes (= 8X64) at the time within that block, there will be cache misses. In other words, CPU caches have have limited amount of cache lines dedicated to large continuous block of RAM (unless they are fully associative caches)

From CPU to DRAM access there is typically 64-byte and 4096-byte regions with range cross penalties. I think 64-byte cross penalty is typically less than 10 cycles, 4096-byte region range cross penalty is several tens of cycles (this on the top of the penalty of accessing DRAM).

rb808 · on May 29, 2018

Interestingly in Java 8 they changed HashMap buckets to hold a binary tree instead of linked list to change from O(n) to O(log N).

https://stackoverflow.com/questions/24673567/change-to-hashm...

utopcell · on May 29, 2018

To be more accurate, binary trees are constructed only after a bucket ends up containing a small number of elements, which should only happen rarely unless a bad hash function is chosen.

dbaupp · on May 29, 2018

That's defending against hash collisions which isn't what the grandparent is noticing (the slowing down of memory accesses due to not fitting into cache).

josefx · on May 29, 2018

I think they implemented that change to deal with a denial of service attack in http servers. A malicious client could send a large list of parameters with an identical String::hashCode() value, resulting in a single large bucket. Other workarounds included a size limit on the parameter list or a hash function with some parameters randomized at program start.

2FC00C733A55 · on May 29, 2018

FYI this only happens if the objects implement comparable; if you do not, they will not make binary trees.

andromeduck · on May 29, 2018

Big O is about doing the big picture high level analysis without wasting time on every little detail like the complexity of addition and multiplication. High performance implementation on the other hand is all about twiddling the constant factors in your algorithm(s) for your specific hardware architecture/topology.

See:

http://kaldewey.com/pubs/FAST__SIGMOD10.pdf

whatshisface · on May 29, 2018

The parent's claim implies that on a machine with an abstract cache, the complexity of hashmaps could be proven to be something else. You may be able to recover the right behaivor by modeling a machine with infinite memory, divided up into an infinite tower of caches where each level was P times bigger and Q times slower.

steve_musk · on May 29, 2018

You definitely have to take the time complexity of addition/multiplication into account when doing big O analysis. But in most cases they are O(1). It is the constant factors you don’t worry about.

andromeduck · on May 29, 2018

Yeah but the same could be said about memory accesses. You can always pessimize and assume it misses LLC at which point becomes a constant factor.

steve_musk · on May 29, 2018

Yes you also have to take memory access into account if you are doing a proper big O analysis, but not for the reasons you implied. I don’t think you are thinking about time complexity correctly. The question is not how long does a memory access take, but how does it scale as the number of bits to be fetched increases. The naive answer for a Von Neumann arch is O(N) where N is the number of bits, although the answers here regarding the speed of light bounds are interesting.

For addition and multiplication the current best time complexity is some long term with logs and other terms I can’t remember, but it is not constant.

However, in most cases such as sorting involving memory accesses or mult/addition we are using a fixed bit size, so we can correctly think of them as being O(1).

dspillett · on May 29, 2018

> divergence between what's taught in school and in practice

I remember this sort of thing being mentioned but not in as much detail as I would have gone into (though I'd self-taught a lot of this before Uni, so for me it was more an exercise in unlearning mistakes and formalising & rounding off knowledge than it was about taking in new concepts, perhaps for others cramming the extra detail in would have been counter productive). Algorithm courses would say things like "of course caching methods should be taking into consideration, you'll cover this in architecture modules" and architecture modules would say "you'll cover the effect on some algorithms in much more detail in your software engineering modules"!

All the JS sorting demos that were an active fad a short while ago were simplified in the same way: they assumed that either a comparison or a swap was always the most expensive operation and the other can be discounted, or that all swaps/comparisons are equal due to uniform memory speed (no accounting for caching), and no consideration was made for what was being sorted (inserts are much more expensive in an array than a linked list for instance). I keep meaning to find the time to put a less superficial demo together that allows such things to be modelled at a basic level, perhaps even trying to illustrate the effect of a more distributed architecture (where there is more than one processing unit involved and everything isn't in local fast memory).

Of course, for the most part these simplifications are perfectly valid, but they should always carry a caveat noting what simplifying assumptions are being applied so people learning know that the issues could be more complicated so some critical thinking should be applied in any real world situation.

MaxBarraclough · on May 29, 2018

> Just thought it's an interesting divergence between what's taught in school and in practice.

Not really, no. This isn't a matter of complexity theory being less useful than promised, it's just a matter of understanding what it tells you.

In complexity theoretic terms, platform-specific micro-optimisations are constant factors. They can have a real impact, but they're only worth worrying about after you've got good complexity.

Bubble-sort in fine-tuned SIMD assembly code will out-perform bubble-sort in Python, but will be far slower than a Python quicksort (except for small inputs).

Notice that we're discussing which hash-map to use, as opposed to which data-structure to use in the first place. Their basic time-complexities are the same, so we have the luxury of worrying about the constant factors.

Linear-scan dictionaries can't compete here, no matter how well micro-optimised. (Again, except for very small inputs.) Complexity theory tells us why.

Additionally, complexity theory is useful when reasoning about large problems (high values of n), but it doesn't tell you how to do the platform-specific micro-optimisations that are important when attacking small problems. Implementation details will dominate there, rather than the number of algorithmic operations.

Anyone who teaches complexity theory without explaining these points, has done their students a disservice.

> I wonder if there's something that's more predictive than big Oh for these types of analysis.

When it comes to real-world performance, there's no substitute for just measuring real-world performance. Detailed analysis of things like cache behaviour, and behaviour under highly concurrent access, could be neat though.

Edit For completeness, I should mention that in extreme cases, there are indeed algorithms with superior complexity which are in practice completely useless, as the break-even point is so awfully high that they could never be useful. Matrix multiplication, for instance, where the algorithms with the best time-complexity are quite literally never useful. https://en.wikipedia.org/w/index.php?title=Computational_com...

Perhaps that contradicts my whole point :-P

The other limitation of complexity theory is that in practice we often care about average-case performance, whereas theorists tend to care more about worst-care performance.

voidmain · on May 29, 2018

The increasing latency [1] of access to larger amounts of storage isn't a constant factor, though, even asymptotically. There was a fun post on HN a while ago arguing that random access latency to N bytes is asymptotically O(N^0.5), from both empirical data about real systems and theoretically from the Bekenstein bound and the speed of light. And even if you don't buy that it clearly can't be less than O(N^1/3). Don't go preaching this in job interviews, though :-)

The truth is that complexity theory is always done relative to some abstract machine model, and as with all models it's up to you to determine if the model is a good fit for your practical reality. "To the extent that mathematics reflects reality it is not certain, and to the extent that it is certain it does not reflect reality." There exist abstract models, like the cache oblivious model, that do a somewhat better job of grappling with this issue but don't require to just throw up your hands and rely solely on benchmarks of real hardware. There is probably room for new theory in this area.

[1] not throughput, which isn't as hard to scale up

timerol · on May 29, 2018

The article in question: https://news.ycombinator.com/item?id=12383012

It's fun to think about, but the presentation played fast and loose with a few concepts. Sparked some lively discussion because of that. I really like the idea of needing to count random memory accesses as costing more than sequential accesses when analyzing algorithms in a big-O-type notation.

Ar-Curunir · on May 29, 2018

back when HDDs were a thing, and sequential accesses were much, much faster than random accesses.

So people developed algorithms that maximised locality of reference, like B-trees. An interesting line of work in this vein is that of cache-oblivious algorithms :)

MaxBarraclough · on May 29, 2018

> even if you don't buy that it clearly can't be less than O(N^1/3)

Maybe I'm missing something, but why's that?

To the rest of your comment: all good points.

voidmain · on May 29, 2018

Well, if with your best technology you can pack B bits of information in 1 m^3 of space, and you need to store N bits of information, you are going to need (N/B) m^3 of space to store it. And then the speed of light limits you to around (N/B)^(1/3)/c seconds of latency to get to the farthest bits.

(This ultimately violates the Bekenstein bound - as your server cube grows it will eventually collapse into a black hole - but this is not really a concern with forseeable technology! More practically, though, Earth's gravity makes it hard to build very high in the air, which limits you to O(N^1/2), and things like cooling and maintenance access probably impose a limit somewhere in between.)

MaxBarraclough · on May 29, 2018

Makes sense, thanks.

adrianN · on May 29, 2018

Imagine your bits stored in a sphere around your CPU. Take into account the time light needs to reach the bit you want to touch.

utopcell · on May 29, 2018

The performance gap between performing cache-hot lookups vs lookups that miss can be 10X. The increase is more a reflection of that rather than doing more operations.

The RAM model doesn't concern itself with memory hierarchies or slow ops (ie divisions are supposed to be as fast as additions.)

For example, if you were benchmarking cuckoo hashing lookups, the results would look similar even though the number of operations is deterministically O(1) rather than expected O(1).

zzzcpan · on May 29, 2018

It's not a log-scale graph for the things you talk about. Y-axes are linear scale. So for O(1) algorithm time has an upper bound and plateaus as caches stop being helpful and every operation has to go all the way into the RAM.

phamilton · on May 29, 2018

I really liked this series: http://www.ilikebigbits.com/blog/2014/4/21/the-myth-of-ram-p...

While, like others have said, big O and execution are different, in practice the article makes a lot of sense when evaluating execution.

utopcell · on May 29, 2018

If we were talking about the worst lookups, then even in the RAM model they would become slower as the map grows.

To be more specific, if you place n balls into n bins, on average each bin will contain 1 ball, but the largest one, with high probability, will contain O(log(n)/log(log(n))) of them.

dekhn · on May 29, 2018

can any hash table whose hash function operates over the entire string ever show O(1) time?

alayne · on May 29, 2018

The O(1) amortized access of a hash table is for the number of entries, not key hashing or comparison. Doubling the number of entries does not double the retrieval time.

As a hash table gets larger, the cost of key hashing (for the same domain of keys) does not increase. Of course hashing and comparison performance can still be important, it's just not what is generally being analyzed with basic complexity theory.

dekhn · on May 29, 2018

I guess I get the idea that the O(1) is for the number of entries, but in my experience 99% of prod time is spent hashing keys on amortized O(1) systems, so focusing on the basic complexity theory would seem to be leaving a lot of performance on the table and also contributes confusion.

nine_k · on May 29, 2018

> can any hash table whose hash function operates over the entire string ever show O(1) time?

If we limit the length of the has key, the time taken by the hash function is also limited and turns into a constant. Big-O is all about asymptotic behavior.

OTOH noticing how the hash function's run time depends on the size of the key can be separately useful. I suspect all such functions are linear in practice, though.

dekhn · on May 29, 2018

I'm talking about functions where the entire string is used to that's necessary on the internet when dealing with billions of strings that are similar to each other

gpm · on May 29, 2018

One thing to think about is that accessing a byte memory in a system with n bytes is pretty much guaranteed to be at best a log(n) operation. After all you need log(n) sized address to index the memory, and you need to read every single one of the bits in the address to index the memory.

In practice on any particular machine all the addresses are the same length (and we have finite memory), so it doesn't quite explain the effect you are observing in the article. But it does hint that it makes sense.

ant6n · on May 29, 2018

I thought a hash table has O(n) access.

bufferoverflow · on May 29, 2018

Average O(1), worst case O(n)

So worse than a B-tree or a Red-black tree for the worst case.

ant6n · on May 31, 2018

"Average O(1)" is an oxymoron. Big Oh is about the worst case. The only thing that makes sense is "Amortized O(1)", but I believe its possible to create access cases where hash tables are slower.

zzzcpan · on May 29, 2018

Only some primitive hash table algorithms have O(n) worst case access. No one implements those algorithms anymore though.

IncRnd · on May 29, 2018

Big-O specifically refers to worst case, though, people use Big-O to refer to average or worst case scale in-practice.