Calendar Queues: A Fast O(1) Priority Queue Implementation (1988)

bob1029 · 2024-08-30T07:56:28 1725004588

I spent a solid few days chasing this exact damn rabbit.

I thought I could beat the PQ implementation in .NET with something like this but I never even got close.

I think my use case breaks the assumptions in this paper due to the volatility of the distribution over time.

Edit: For reference, this is the approach taken by .NET - https://en.m.wikipedia.org/wiki/D-ary_heap

neonsunset · 2024-08-30T12:57:43 1725022663

If you would like to contribute, there might be a better optimization opportunity in the current bounded Channel<T> implementation: https://github.com/dotnet/runtime/discussions/104791#discuss...

nbingham · 2024-09-05T14:23:29 1725546209

I tested two C++ implementations of Calendar Queues again the standard library priority_queue. The first implementation uses a deque as a backing container and then fills the calendar with linked lists. The second implementation just uses vectors in the calendar with no backing container.

The calendar queues have an interesting failure mode. If you are randomly inserting elements with a priority that is less than "now", and then pop some number of elements, and doing this repeatedly, then the front of the calendar empties out. As a result, the random insertions create a single value in the front of the calendar, then there are many many days that are empty after that. So subsequent pops will always have to search a lot of days in the calendar. So, these calendar queues are only fast if you keep track of "now" and only insert events after "now".

https://gist.github.com/nbingham1/611d37fce31334a1520213ce5d...

seed 1725545662 priority_queue 0.644354 calendar_queue 0.215860 calendar_queue_vector 0.405788

seed 1725545667 priority_queue 0.572672 calendar_queue 0.196812 calendar_queue_vector 0.392303

seed 1725545672 priority_queue 0.622041 calendar_queue 0.241419 calendar_queue_vector 0.413713

seed 1725545676 priority_queue 0.590372 calendar_queue 0.204428 calendar_queue_vector 0.386992

throwaway81523 · 2024-08-30T06:53:45 1725000825

It would be nice to note in the title that this is a pdf. The algorithm is something like the timer wheels in the Linux kernel. Related to radix sorting more or less. Basically there are a bunch of buckets containing sorted lists of events. I didn't read too carefully since most people use a heap for this, which is O(log n) but likely has better constants.

rhelz · 2024-08-30T09:43:00 1725010980

I've had very similar experience as other commenters have stated W.R.T. calendar queues vs just good-old-fashioned std::priority_queue.

Then, one day, my team hired this ancient soviet engineer who looked like he could have been Lenin's drinking buddy. He was not impressed that I was using std::priority_queue, and he sat down and wrote a calendar queue.

I'll be damned if that thing wasn't 7 to 9 times faster. I thought I was an engineer, but next to this guy, I was just a monkey poking at the typewriter.

It is possible to make a calendar queue which will absolutely mop the floor with any other queue, but the algorithms given in these papers is just a starting point. Going from the published algorithm to an actual performant, production-ready product is always the hardest part.

throwaway81523 · 2024-08-30T10:13:35 1725012815

As mentioned, something like it already exists inside Linux. Maybe it could be pulled out and turned into an app library, if it's so much better than a heap queue. Info: https://duckduckgo.com/?q=timer+wheel+linux

I remember writing a heap queue in C++ myself because std::priority_queue had some kind of shortcoming whose specifics I don't remember. Maybe I can find that program and check what it wanted. It wasn't a performance issue, but rather, something I needed was missing from the stdlib API and I remember thinking that it was silly that they omitted it.

Ono-Sendai · 2024-08-30T12:39:39 1725021579

What you are thinking of is probably that you can't erase elements (apart from the top element) from the priority queue.

throwaway81523 · 2024-08-31T05:41:25 1725082885

No that wasn't it. It may have had to do with constraints on the types of values that the stdlib priority queue could hold.

kazinator · 2024-08-30T17:10:37 1725037837

I independently invented something similar around 1993 inside the scheduler of a threading implementation. I wanted to have a priority scheme whereby the ratios of priority values determined the amount of CPU quanta given to the thread. E.g. a priority 5 thread would twice the CPU time compared to a priority 10.

I called the algorithm "appointment calendar". Threads were scheduled in a calendar, with the lower priority threads (higher value) getting appointments farther in the future. The scheduler just marched through the calendar in order, taking the appointments.

kevinventullo · 2024-08-30T11:43:39 1725018219

I believe the Re-Pair algorithm used for doing linear time byte-pair encoding makes use of a similar idea: https://en.m.wikipedia.org/wiki/Re-Pair

There, instead of dates, the “priority index” reflects the frequencies of pairs seen in the input string. This leads to guaranteed O(1) runtime amortized over the input string, since the largest frequency count is bounded by the size of the input and can only decrease as merges happen.

Ono-Sendai · 2024-08-30T06:52:36 1725000756

I think I reinvented and started implementing something like this, but then just ended up using std::priority_queue (the C++ standard library priority queue) which is pretty fast.

timClicks · 2024-08-30T12:56:11 1725022571

Strictly speaking, you ended up using your compiler's priority queue. The standard defines the interface and invariants, but the implementer has discretion about the implementation. There are also several knobs you can turn to tweak the performance characteristics. std::priority_queue is a container adapter that can be applied to many of the standard containers. https://en.cppreference.com/w/cpp/container/priority_queue

quuxplusone · 2024-08-30T15:32:27 1725031947

[Nit: Not "your compiler's" but "your library's." The C++ Standard Library is generally provided by the compiler vendor, but it's not built into the compiler, except for tiny pieces like `std::bad_alloc` and `std::strong_ordering`.]

The implementor has far less freedom than your answer seems to be implying. The standard specifies, for example:

https://eel.is/c++draft/priority.queue#priqueue.cons-4

> The constructor calls make_heap(c.begin(), c.end(), comp).

https://eel.is/c++draft/priority.queue#priqueue.members-5

> emplace calls push_heap(c.begin(), c.end(), comp).

And so on. In fact, if I weren't trying to "yes and" you, I'd say there is essentially no implementation freedom. In particular, the user-programmer is allowed, at any point, to extract the protected data member `c` and verify that it is in fact heapified in the same way that `std::push_heap` would have heapified it.

That said, std::priority_queue::pop is specified to behave "as if by pop_heap followed by pop_back," and in fact the vendor can do better there, by using Floyd's "bottom-up" algorithm. LLVM's libc++ switched from the naïve implementation to Floyd's version back in early 2022, thus closing a feature request that had been open for 11 years at the time:

https://github.com/llvm/llvm-project/commit/79d08e398c17e83b...

I think the other two major vendors had already switched by then, although I'm not sure.

The implementation definitely does not have the freedom to switch from the mandated heap-based PQ to any alternative kind of PQ, including but not limited to (TAOCP §5.2.3) "leftist or balanced trees, stratified trees, binomial queues, pagodas, pairing heaps, skew heaps, Fibonacci heaps, calendar queues, relaxed heaps, fishspear, hot queues, etc."

I once wrote an STL-style implementation of Fishspear, with some analysis of its pros and cons. https://quuxplusone.github.io/blog/2021/05/23/fishspear/

Validark · 2024-08-31T10:11:55 1725099115

Floyd's algorithm scales worse than the naïve implementation though (for small data elements, when comparisons are cheap).

See "Performance Engineering Case Study: Heap Construction" by Jesper Bojesen, Jyrki Katajainen & Maz Spork

The reason why the naïve algorithm scales better is because with each successive push into the heap, you are likely to hit on the same cache lines (assuming your individual elements aren't the size of the cache line or bigger). This is significantly more important for performance than saving a few comparisons, assuming comparisons are cheap.

timClicks · 2024-08-31T04:27:05 1725078425

Thank you very much for taking the time to comment. One of the motivations for mine was to spur someone else who is much more knowledgeable to provide their thoughts. Thank you.

Sesse__ · 2024-08-30T13:10:58 1725023458

std::priority_queue is sorely missing the operation “change the priority of this element” (you need to do it using a delete and then a new insert, which is rather slow), which comes up all the time in e.g. Dijkstra's algorithm.

quuxplusone · 2024-08-30T15:52:25 1725033145

Yes, and, there are two related but different operations there:

- Look up an arbitrary element by its value, and then change that element's priority. This is often needed in real life, but is fundamentally incompatible with std::priority_queue's highly restricted design. There is no public API at all for dealing with "arbitrary elements" of a std::priority_queue; you interact only with the .top() element.

- Change the top element's priority, i.e. handle it and then throw it back down to be dealt with again sometime later. This operation is used in e.g. the Sieve of Eratosthenes.

I'm not sure which operation you're thinking of w.r.t. Dijkstra's algorithm; I'd wildly guess it's the first operation, not the second.

Changing the top element's priority is easy to graft onto the STL priority_queue's API. I've done it myself here: https://quuxplusone.github.io/blog/2018/04/27/pq-replace-top... The proper name of this operation is `pq.replace_top(value)`, and for the perfect-forwarding version, `pq.reemplace_top(args...)`.

Search `reemplace_top` in this Sieve of Eratosthenes code: https://godbolt.org/z/bvY4Mr1GE

OskarS · 2024-08-30T21:13:59 1725052439

You don't actually need to have the "adjust priority of element" operation to implement Dijkstra or A-star. The standard description of the algorithm always include this, but it is not actually necessary: instead of adjusting element priority, you just push duplicate vertices with new priorities on to the queue, and when you pop the queue, you just check if you've already seen this vertex before. If so discard it and pop the next one. The algorithm still works, since the first time you pop a vertex that is the shortest path, and the rest of the time you can ignore it. Simple to implement and plenty fast. There's no difference in time complexity: you have to consider the "duplicate" case at some point, you're just pushing to a later time when you pop it from the queue.

You might argue that is wasteful of space pushing these duplicates, but your other options are either to graft this functionality on to a normal priority queue in which case you're using that space anyway, or to use a much more complex and usually slower kind of priority queue with this operation naturally (e.g. Fibonacci heaps). The space wasted is quite small in practice, since the only time this happens is if multiple nodes on the frontier points to the same element, but most nodes ("in practice") have small degree of incoming paths. The benefit of being able to use standard (and very fast!) priority queues without this weird operation is well worth it.

In my experience of implementing Dijkstra and A-star a couple of dozen times (I like Advent of Code problems!) this has always been the better way to do it. I mean, I haven't put Dijkstra/A-star into production or anything (I don't work for Google Maps or whatever), but in my experience this is the simplest and fastest way in practice to implement these algorithms.

Sesse__ · 2024-08-31T11:32:11 1725103931

> You might argue that is wasteful of space pushing these duplicates, but your other options are either to graft this functionality on to a normal priority queue in which case you're using that space anyway, or to use a much more complex and usually slower kind of priority queue with this operation naturally (e.g. Fibonacci heaps).

FWIW, my favorite solution for Dijkstra is the winner-tree (a binary heap except that all the nodes are in leaves, so the interior nodes are duplicates). Simple to implement, O(log n) insert/delete/update, can be made branch-free in important cases, constant extra overhead over a binary heap. I've found it to be _much_ faster in practice than stashing duplicates into a regular maxheap (I'm not sure if I agree with your notion that there are few of them).

alexhutcheson · 2024-08-30T15:09:18 1725030558

Boost.Heap has this functionality. Or if you want to stick with the standard library it’s fairly easy to use the *_heap functions from <algorithm> and just hand-code your own fix_heap(first, last, changed) function. Agree it would be more convenient to have it built-in, though.

packetlost · 2024-08-30T15:30:35 1725031835

Everything can be O(1) if you put bounds on every operation.

kevindamm · 2024-08-30T17:22:42 1725038562

The difference of note is when the problem statement has an implicit constraint that makes the bounds reasonable vs. when the bounds are arbitrary and artificially constrain the problems that can be solved.

Sorting integer elements that can only be represented by 64-bit ints and smaller? Radix sort for the win (*). Sorting strings which may be of any length? Well, saying you can do that in linear time is a bit disingenuous.

It is always important to recognize the properties of the problem domain when stating complexity bounds.

(*) of course, any vanilla comparison-based sort is a better first-implementation than radix sort, but we're talking about linear time algorithms here

amelius · 2024-08-30T12:04:45 1725019485

If this is really O(1), then that makes sorting O(N).

mananaysiempre · 2024-08-30T12:08:00 1725019680

It does, radix sort is in fact O(N) if size of the universe of values counts as a constant. It’s just slow in practice.

The definition of the machine for which the O(N log N) bound is proved is very delicate: you have to allow O(1) operations on an arbitrarily large set of values but not encoding tricks allowing multiple values to be packed into one and then manipulated unrealistically cheaply using those operations. In particular, the machine must not be able to do arbitrary arithmetic.

Sesse__ · 2024-08-30T13:09:07 1725023347

Or stated equivalently: The only operations allowed on elements are binary comparisons and two-element swaps, but both are O(1).

mananaysiempre · 2024-08-30T13:42:44 1725025364

Kiiinda. Two-element swaps are a stretch already for merge sort, especially the O(log N)-space linked list version, let alone search trees and so on. At some point you also need to make sure you can’t sneak arbitrary computation into the (necessarily unlimited-magnitude) array index.

nwellnhof · 2024-08-30T14:26:24 1725027984

A better comparison is bucket sort which is O(N) with uniformly distributed keys.

amelius · 2024-08-30T12:11:03 1725019863

> if size of the universe of values counts as a constant

But of course, that is cheating.

mananaysiempre · 2024-08-30T12:18:46 1725020326

I mean, it depends. In Unicode normalization you have to do a stable sort of an arbitrary number of values (code points) that can only ever map to a small finite number ( < 256) of sort keys (combining classes). Insertion sort is the best choice for ordinary inputs, but for adversarial ones a counting sort is probably your best bet.

tonyg · 2024-08-30T12:11:33 1725019893

It looks like it's rather sensitive to the distributions of inputs. The claim in the abstract is that it's O(1) "for the priority increment distributions recently considered by Jones in his review article." The conclusion gives a bit more detail.

tonyg · 2024-08-30T12:08:43 1725019723

Year in title is wrong: the paper is from 1988, not 1998.