Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Index 1.6B Keys with Automata and Rust (2015) (burntsushi.net)
174 points by ithkuil on Aug 20, 2021 | hide | past | favorite | 10 comments


The FST mechanism for using an acyclic graph to store information is very reminiscent of GADDAG[1] structure for Scrabble dictionaries.

In my search for interesting problems in the compression space, I ran into quite a lot of genomics research which suggests combining a compression algorithm with a lookup method, because identifying common items between samples reduces both storage and speeds up lookup.

> An FST based data structure as presented in this article does not have this capability. Namely, once it is produced, it cannot be changed.

That is somewhat of a weakness in general for succinct data structures of this sort, which makes it harder to remove data (much easier to maintain a "deleted set" to go with it, but that has problems with a delete + re-add).

[1] - https://en.wikipedia.org/wiki/GADDAG


Great article, but should be marked (2015) as it is a bit of a classic ;)


charming article, as always, from burntsushi.

You may also enjoy his xsv util [ https://github.com/BurntSushi/xsv ] for wrangling large csv files.


Interesting that there are so many ways to implement ordered sets and maps. V8 takes a very different approach when implementing ordered sets and maps. It works more like a linked list.

https://itnext.io/v8-deep-dives-understanding-map-internals-...


> It works more like a linked list.

After looking at it, it doesn't. The "linked list" bit is just that it uses closed addressing (separate chaining), but instead of having a linked list of buckets, all the buckets are part of the array and the links point back into the array.

Raymond Hettinger's naturally ordered hashmap uses a similar "trick" for an open addressed map: a sparse array of hashes, and a separate dense array of actual data.

In both cases, the ordering is preserved through a dense array and the sparse "hash map" indexes into that array, this is as opposed to the older method of ordering (e.g. LinkedHashMap or PHP's own implementation) where the map items would be augmented with a (usually doubly) linked list maintaining the ordering information.

Of note: this order version can still be useful if the ordering information can be manipulated, manipulating ordering in the dense-array version would be expensive and convoluted. That's why even though CPython switched to the "hettinger map" back in 3.6 the OrderedDict collection still uses a doubly linked list to maintain its ordering information.


Note that there are two different meanings of “ordered” in this context. In an ordered hash map (like PHP, JavaScript, and Python) it means insertion order. But the article is talking about lexicographic order, for which you need a tree structure of some kind (binary tree, btree, trie, or this article’s fst).


(2015)


cool but smthng is telling me I've seen smthng exactly like this before.


Seems to be quite popular given its 8 total submissions[0] :)

[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


Glad it was reposted as I've never seen it before!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: