Index 1.6B Keys with Automata and Rust (2015)

gopalv · on Aug 21, 2021

The FST mechanism for using an acyclic graph to store information is very reminiscent of GADDAG[1] structure for Scrabble dictionaries.

In my search for interesting problems in the compression space, I ran into quite a lot of genomics research which suggests combining a compression algorithm with a lookup method, because identifying common items between samples reduces both storage and speeds up lookup.

> An FST based data structure as presented in this article does not have this capability. Namely, once it is produced, it cannot be changed.

That is somewhat of a weakness in general for succinct data structures of this sort, which makes it harder to remove data (much easier to maintain a "deleted set" to go with it, but that has problems with a delete + re-add).

[1] - https://en.wikipedia.org/wiki/GADDAG

chromatin · on Aug 21, 2021

Great article, but should be marked (2015) as it is a bit of a classic ;)

jstgord · on Aug 21, 2021

charming article, as always, from burntsushi.

You may also enjoy his xsv util [ https://github.com/BurntSushi/xsv ] for wrangling large csv files.

mfbx9da4 · on Aug 21, 2021

Interesting that there are so many ways to implement ordered sets and maps. V8 takes a very different approach when implementing ordered sets and maps. It works more like a linked list.

https://itnext.io/v8-deep-dives-understanding-map-internals-...

masklinn · on Aug 21, 2021

> It works more like a linked list.

After looking at it, it doesn't. The "linked list" bit is just that it uses closed addressing (separate chaining), but instead of having a linked list of buckets, all the buckets are part of the array and the links point back into the array.

Raymond Hettinger's naturally ordered hashmap uses a similar "trick" for an open addressed map: a sparse array of hashes, and a separate dense array of actual data.

In both cases, the ordering is preserved through a dense array and the sparse "hash map" indexes into that array, this is as opposed to the older method of ordering (e.g. LinkedHashMap or PHP's own implementation) where the map items would be augmented with a (usually doubly) linked list maintaining the ordering information.

Of note: this order version can still be useful if the ordering information can be manipulated, manipulating ordering in the dense-array version would be expensive and convoluted. That's why even though CPython switched to the "hettinger map" back in 3.6 the OrderedDict collection still uses a doubly linked list to maintain its ordering information.

fanf2 · on Aug 21, 2021

Note that there are two different meanings of “ordered” in this context. In an ordered hash map (like PHP, JavaScript, and Python) it means insertion order. But the article is talking about lexicographic order, for which you need a tree structure of some kind (binary tree, btree, trie, or this article’s fst).

seeekr · on Aug 21, 2021

(2015)

afr0ck · on Aug 21, 2021

cool but smthng is telling me I've seen smthng exactly like this before.

dvaun · on Aug 21, 2021

Seems to be quite popular given its 8 total submissions[0] :)

[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

vendiddy · on Aug 21, 2021

Glad it was reposted as I've never seen it before!