Why Polars rewrote its Arrow string data type

Rygian · on Aug 6, 2024

> “short string optimization”: A short enough string can be stored “in place” [...] An optimization that’s impossible in Rust, by the way ;).

Author is not aware of https://docs.rs/compact_str/latest/compact_str/ or https://github.com/bodil/smartstring

jtrueb · on Aug 7, 2024

This is false. The polars api has used smart string for a long time.

https://github.com/pola-rs/polars/blob/32a2325b55f9bce81d019...

diggan · on Aug 6, 2024

That documentation talks about all the benefits and "can mostly be used as a drop in replacement for String", but what are the tradeoffs? When cannot it be used?

tialaramex · on Aug 6, 2024

It looks like there's a bunch of "garbage collection" type activity where you may have some bytes which were once part of a string but aren't now used, and you're always paying for the overhead of this optimisation even if it's useless for your problem.

Suppose you work only with 500-4000 byte strings, maybe they're short reviews, and each ends with a rating in star emoji, ***** is the best * is the worst. [[HN ate my star emoji of course]]

So your reviews never fit in the "optimised" string slot, but also the prefix is just opening words from a review, which in some review styles will be the start of seemingly unrelated anecdotes. "My grandfather used to tell me" it'll get to the review eventually, and you'll see why they're connected, but the suffix is useful and that's not stored in a "German string" data structure.

Or maybe you have a high turnover of somewhat related medium size strings, so then that garbage collection step costs quite a lot of overhead.

LegionMammal978 · on Aug 7, 2024

Any code built closely around String's power-of-2 reallocation pattern may have to be reworked. I don't think there's any case when it cannot be used as a String replacement at all, except maybe when interfacing with an API that expects a &mut String as an output parameter.

runeblaze · on Aug 7, 2024

My uninformed guess is that at least it will cost you the branching because you need to check if it is inlined or not, and you pay that for every string. Branch prediction is likely very good for this case though.

charliermarsh · on Aug 6, 2024

Was this section removed? I'm not seeing it in the linked post.

tialaramex · on Aug 6, 2024

It seems to be a quote from https://cedardb.com/blog/german_strings/ which is about this German Strings type (implemented in Polars)

But yeah, it's pretty ignorant to assume Rust can't do this since the best available examples (as with many things) are in Rust. CompactString is really nice. On a typical modern (64-bit) computer CompactString takes 24 bytes and holds up to 24 bytes of UTF-8 text inline, while also having a niche.

I guess the confusion arises because C++ people tend to assume that anywhere Rust differs from the practice in the C++ community it's a mistake, even though that's often because C++ made the wrong choice? Rust's &str is "just" &[u8] plus a rule about the meaning of these bytes, and Rust's String is correspondingly "just" Vec<u8> plus a rule about the meaning of those bytes. C++ couldn't have done the former because it only belatedly got the fat pointer slice reference (as the troubled std::span) years after having a string data type.

Rust didn't do this in the stdlib, but not because it's impossible, because it's a trade off and they wanted the provided stdlib type to be straightforward. If you need or even just want the trade off, you can just cargo add compact_str

throwaway48540 · on Aug 7, 2024

I remember some C++ colleagues raving on about the standard library having anything and everything they might ever need. Makes sense in a world without sane package managers and package registries, but that mindset just doesn't carry over.

kalaksi · on Aug 7, 2024

A decent standard library very much makes sense even with sane package managers and registries. Just look at JS. It's awful that you need to hunt for packages for simple stuff (or implement yourself). Stdlib is usually straightforward to use, good quality, trustworthy and good and lasting support

rob74 · on Aug 7, 2024

The "good and lasting support" part is the most important IMHO. Nothing is more annoying than having to switch to another library because the one you are currently using goes unmaintained. This can happen in the "stdlib" too (e.g. PHP deprecating the mcrypt library), but it happens much less often, and typically with much more time to prepare. Also important is that, when there is a "standard" stdlib package, other dependencies you might use will probably use that - having two dependencies that use different packages for doing the same thing is also annoying.

giancarlostoro · on Aug 7, 2024

I attribute Go's success to a rich OOTB standard library. You can build so much before you reach for third party packages. Heck, for web services, the built-in + their templating gets you pretty far.

tux3 · on Aug 7, 2024

That's a bit silly, the C++ standard library is pretty small ("the intersection of your needs, not the union" is what they say), and it's often not even really good at the few things that it does. Whether that's core stuff like hashmaps and iostreams, or more niche stuff like std::regex.

adrianN · on Aug 7, 2024

That development is quite recent. The stdlib used to be very barebones. It still is compared to for example the Python one.

saagarjha · on Aug 7, 2024

> I guess the confusion arises because C++ people tend to assume that anywhere Rust differs from the practice in the C++ community it's a mistake, even though that's often because C++ made the wrong choice?

Funny, I hear that a lot from the Rust folks.

ladyanita22 · on Aug 6, 2024

Why is it impossible in Rust? Do you have any source for that?

steveklabnik · on Aug 6, 2024

It is not. It is not implemented in std::string::String, but (as pointed out elsewhere in this thread) there are other string implementations that have it.

It was decided explicitly against for the standard library, because not every optimization is universally good, and keeping String as a thin wrapper over Vec<T> is a good default.

jerf · on Aug 7, 2024

This is one of those things that I wish people would speak more carefully about. I've seen it in every programming language community I've participated in, so it's not a language-specific thing, but... one should not say "language X does not do a thing" when they mean "language X's standard library does not do a thing". That the "language" doesn't do a thing should be reserved for the cases where the language itself really does preclude some particular thing for some reason. Otherwise the relatively inexperienced programmers end up coming away with some really weird mental models of what programming languages can and can not do, just like here. Of course Rust qua Rust can store strings like this, it's basically built for things like this. Any mental model of Rust that thinks Rust qua Rust can't do this is a weird mental model of Rust.

steveklabnik · on Aug 7, 2024

Yep, especially with a systems language. As you say, they're basically built for "I need to do something specific."

mhh__ · on Aug 7, 2024

When are small strings bad? Parallelism?

workingjubilee · on Aug 7, 2024

The "small string optimization" makes the strings harder to manipulate with `unsafe` code. However, being easy to manipulate with `unsafe` code is in fact a priority for most std types, so that they are easily understood, extended, rearranged, passed over FFI and then later reconstituted, etc.

You can tear apart these types and reassemble them very easily. For many C++ std types, you cannot do this.

kibwen · on Aug 7, 2024

Rust APIs use string slices (&str) extensively. With its current design, converting from a String to a &str is a no-op; if String instead did small string optimization, converting to &str wouldn't be free. Furthermore, thanks to the borrow checker, Rust code tends to avoid copying strings around, so the benefit of SSO is reduced. C++ does more string copying and didn't have a standard string_view for a long time, so considering the tradeoffs both languages made reasonable decisions.

steveklabnik · on Aug 7, 2024

Just wanted to say that I only got back to this thread now, but I agree with my sibling commentors.

Here's a discussion about it from a few years back, with some links to the primary discussions: https://news.ycombinator.com/item?id=18372332

CJefferson · on Aug 7, 2024

It adds a branch every time you access the string (to check if it is small or not), and can stop various optimisations. g++ used to have small string optimisation, but (eventually) removed it.

n_plus_1_acc · on Aug 7, 2024

You'll need to branch on most operations to check "am i small?". This may cause issues with the Branche predictor.

Rygian · on Aug 6, 2024

> When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid.

In my mind this reads identical to "if you're a security practitioner, worry about this bit here."

kristianp · on Aug 7, 2024

Don't you lose the in-memory interop with other libraries by doing this? I'm thinking that duckdb will no longer be able to read polars data that has been loaded into memory, as it can currently do, due to duckdb supporting Arrow. Isn't the benefit of arrow that it's supported by many languages and libraries as a standard?

Will there be an option to use the "compatible" string format?

mau · on Aug 7, 2024

Arrow supports this string format: https://arrow.apache.org/docs/format/Columnar.html#variable-...

From the article:

> As luck would have it, the Arrow spec was also finally making progress with adding the long anticipated German Style string types to the specification. Which, spoiler alert, is the type we implemented.

ritchie46 · on Aug 7, 2024

DuckDB has their own string type (quite similar to this) that deviates from Arrow (large)-string type, so it had to do a copy anyway. Nothing has changed on that front.

emidln · on Aug 7, 2024

Wouldnt you just recompile your duckdb with the hacked Arrow?

suremarc · on Aug 7, 2024

Arrow is a cross-language standard. Polars is written in Rust but DuckDB is written in C++. So it's not that simple unfortunately.

emidln · on Aug 8, 2024

If you're using Arrow from duckdb, you're using C or C++ bindings. Hack the arrow lib, recompile (and adjust if necessary) whatever bindings/programs using your hacked arrow. If you're using multiple implementations of arrow that arent just bindings, you might have to hack it in more than one place.

You don't have to solve the entire world, you only need to hack what you actually use. If the issue with an optimized arrow impl is memory incompat in duckdb, recompile duckdb with your hacked arrow. This is where "overengineered" stuff like mono repos really shine. I've done exactly this type of thing at past jobs.

Don't let perfect be the enemy of good enough.

xiaodai · on Aug 7, 2024

For short strings, this is basically the same concept as ShortStrings.jl in julia

https://github.com/JuliaString/ShortStrings.jl

lmeyerov · on Aug 7, 2024

I wonder if something like arrow's custom extension types mechanism would streamline building novel data representations without full forks? It may highlight gaps in the extension mechanism

For similar reasons, we've been curious about new compression modes on indexes

ismailmaj · on Aug 6, 2024

I might have missed it in the article but I'm not sure why the prefix is stored for strings that can't be inlined.

pgwhalen · on Aug 6, 2024

Ctrl-F for "Some motivations are as follows" under the "String view with short string optimizations" section here: https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...

Copying here:

> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)

> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.

This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.

make3 · on Aug 6, 2024

the first n bytes are likely by far the most often accessed in practices, specifically for sorting & filtering, etc. Storing them inline is likely a huge optimization for little cost.

NikkiA · on Aug 7, 2024

It's zero cost, since you want the pointer to be 64bit aligned anyway.

Joker_vD · on Aug 7, 2024

> As I mentioned above already pre-allocating the required size of data is hard. This leads to many reallocations and memcopy’s during the building of this type.

Well. Reallocations have to happen mostly because the virtual memory space is flat, so you can't just grow your allocations without the possibility to accidentally bumping into some other object. But having non-flat virtual memory space is really inconvenient (Segment selectors! CHERI! And what about muh address arithmetic?) for other reasons, so here we are.

I toyed with the idea of having a specialized memory allocator for incrementally growing, potentially very large buffers by having it space allocations by, say, 16 GiB, and then there would be the "finalize" operation which would hand over the buffer's contents to malloc by asking malloc to allocate the exact final buffer size (rounded up to the page size) and then, instead of memcpy-ing the data, I'd persuade the OS to remap the physical pages of the existing allocation into the virtual address returned by malloc. The original buffer's virtual addresses then would become unmapped and could be reused.

Unfortunately, I couldn't quite persuade the OS to do that with user-available memory management API so it all came to nothing. I believe there was similar research in the early 90s and it failed because it too required custom OS modifications.

Karellen · on Aug 7, 2024

Given that you're writing a custom allocator, why were you trying to hand allocations over to malloc()? Why not entirely replace malloc() for the process?

(If you still need libc malloc for smaller non-growable allocations under the hood, you should be able to privately access it via dlopen()/dlsym() in your code, shouldn't you?)

Joker_vD · on Aug 8, 2024

> why were you trying to hand allocations over to malloc()?

So that they could be released with free(). For historical reasons, on Linux, most libraries don't generally provide foo_free() functions to be used for freeing objects returned from those libraries, everyone is supposed to use free(), under the tacit assumption that there is only one version of libc loaded in the process which everyone will use. The Windows world has somewhat better culture in this regard.

> Why not entirely replace malloc() for the process?

Now that's just rude.

Karellen · on Aug 8, 2024

> So that they could be released with free().

Well, yes, you have to replace free() and calloc()/realloc() too. Sorry, didn't think that needed spelling out.

> > Why not entirely replace malloc() for the process?

> Now that's just rude.

Isn't that generally what alternate custom allocators do, though? Like dmalloc and jemalloc?

Joker_vD · on Aug 8, 2024

Well, it's one thing when you, as an author of a program, links libraries into your code, and then you pick a custom allocator, and then the libraries you've picked will (transparently) use it. It's an altogether different thing when you, an author of a library, decide to use a custom allocator and then, since your users probably can't be persuaded to use your custom foo_free() that you could export, you hijack the libc's malloc implementation from anyone who links against you and replace it with your own. I personally think it's just rude.

So, with those constraints it's either memcpy-ing into the buffer that you get from the global malloc(), or trying to remap the addresses that that buffer spans onto your own buffer; I thought the latter could be faster starting with moderately large buffers but since I couldn't make it work, I couldn't benchmark it so nothing came out of it.

Karellen · on Aug 9, 2024

Ah.

One other approach that does occur is to just malloc() a 16GiB buffer to begin with. Only pages you touch should end up being backed by RAM(/swap). Then your finalize() operation is just a realloc() to "shrink" the buffer down to its final size. Any decent allocator should keep the data where it is, and just make the now-unused tail portion of address space available again, without ever having needed to back it.

Joker_vD · on Aug 9, 2024

It's possible, absolutely, but if you allocate N such fragments one after another, do some allocations in them, and then finalize them from first to the last, there will be gaps left less than 16 GiB in size, so they could only be reused for small allocations; and IIRC allocators with special support for huge allocations (e.g. using so-called huge pages) do not reuse that memory for "normal-sized" allocations (although I can be wrong on this).

So it's a tradeoff: this fragmentation is not that bad but it's still noticeable in a sufficiently long-running program because 16 GiB is 2**34 so you only can make 16 Ki such allocations before you hit the 2**48 limit. And if you could just simply remap them!..

zeristor · on Aug 7, 2024

No mention of byte alignment, am I just thinking too low level for optimisation?

sakras · on Aug 7, 2024

Well, the array of Umbra strings will almost certainly be aligned to 16-bytes, since each string view is a fixed 16 bytes. For the secondary buffer containing long strings, they would not have any particular alignment because they contain bytes, which are 1-byte aligned. Enforcing an alignment for each string would waste a lot of space, and also it's not clear what that alignment would be. For example, if you had a 37-byte side string, what would you align it to?

sixtram · on Aug 7, 2024

Does anyone know what tool and font was used in the illustrations in the article?

viraptor · on Aug 7, 2024

Most likely Excalidraw and Virgil.

sixtram · on Aug 7, 2024

Thanks for the help, you were right.

mijoharas · on Aug 7, 2024

> The huge (in my opinion pathological) downside of the this data type is that the gather and filter operation on this type scale linear in time complexity with the string sizes.

Can anyone explain why this is?

rvilj · on Aug 8, 2024

The length of the buffer for entire column in O(n * k). If you want to gather a subset of the strings in that buffer, you essentially need to scan across the buffer and copy out the subset you want. Because that buffer grows with string size k, it's linear wrt k. The new implementation doesn't have this problem - you only need to deal with views which are constant size.

justinsaccount · on Aug 6, 2024

Is there a standalone c/c++ implementation of this string type anywhere?

lor_louis · on Aug 6, 2024

It looks a lot like cedarDB's "german strings".

<https://cedardb.com/blog/german_strings/>

You could probably write a C++ implementation based on the article.

Note that it's not all that useful if you don't plan on searching for text based on their prefix. From my understanding, this is mostly a better way to store SStables in RAM/partially on disk if you mmap.

o11c · on Aug 6, 2024

Hm, both the main article and your link are wasteful for strings of length 13-15, which are still pretty common. As a rule for SSO, if you're already take up the bytes unconditionally, it's going to be worth complexifying your "size" calculation to squeeze a little more into the no-alloc case.

That said, note that there are a lot of strings around 20 bytes (stringified 64-bit integers or floats), so pushing to 24-1 is a reasonable choice too.

I'd use 3 size classes:

* 0-15 bytes, stored in-line. If you need C-string compatibility (which is fairly likely somewhere in your application), ensure that size 15 is encoded as a zero at the last byte.

* up to 3.75GB (if my mental math is right), stored with a mangled 32-bit size. Alternatively you could use a simpler cutoff if it makes the mangling easier. Another possibility would be to have a 16-bit size class.

* add support for very large strings (likely with a completely different allocation method) too; a 4GB limit sucks and is easily exceeded. If need be this can use an extra indirection.

Honestly, with a 16-byte budget, I'd consider spending more of that on the prefix - you can get 8 bytes with enough contortion elsewhere.

Duplicating the prefix is probably worth it in more cases than you might think, since it does speed up comparisons. Just remember, you have to check the size to distinguish "a\0" from "a" too, not just from "a\0\0\0\0".

NikkiA · on Aug 7, 2024

The point of a 12-char short string is that it fits, with the size, into a 16byte cache unit and a single memory transfer on current architectures. Anything else will involve some loss of performance.

As for short strings, they don't have a prefix at all - the 12 bytes simply follow the 4 bytes of the length.

Long strings will need the pointer 64bit aligned, so a 4 byte length means you'd have 4 bytes wasted to the pointer alignment anyway, and you fill those with the preview 'prefix'. dword length + 4 bytes + 64bit address = 16 bytes, again. They both occupy the same space in the cache, and only the data at the other end of a pointer on long strings gets pulled into cache if you decide the prefix matches and you need to follow it.

lor_louis · on Aug 6, 2024

Yeah I tend to work with strings data around the 15 to 30 ish char count so I'm also sceptical of german strings when it comes to raw memory usage. What really interests me, is that in theory, an SSTable built from German Strings that point into a larger text buffer further down could result in less pagefaults during a binary search? Maybe?

o11c · on Aug 7, 2024

I mean, if you're doing binary searches, the first thing to do is to switch to Eytzinger searches (at least for the prefixes; it's unclear if Eytzing-sorting large strings matters) which are much more cache friendly. (incidentally, Eytzinger was Austrian, not German - same language though!)

Second, I'd probably switch to struct-of-array instead of array-of-struct, since the search part only needs the (head of the) key. Possibly some trie-like logic would also help (that is, avoid duplicating the common prefix, instead point to an array of continuations rather than a single one), but beware of costly indirections.

Third, I would make sure the user is able to specify the SSO limit, since different domains will have different requirements.

Fourth ... if you're column-oriented (or for simple data in the first place), consider implicit union of different `Set` (or whatever) implementations. For example, store all short strings in one array, and all long strings in a different array.

lor_louis · on Aug 7, 2024

I did not know about Eytzing-sorting, I'll look into it after my vacation, thanks! And yeah our current system is column oriented, and I already tried most of your recommendations (including tries) but the biggest limitation we face is that different kinds of queries will be better served by different kinds of data structures but you can only store a few of them on disk before the cost to index becomes too big.

o11c · on Aug 7, 2024

Whoops, I dropped the "er" the second time. I've also seen it labeled "cache-friendly binary sort", and sometimes the Breadth-First-Search aspect is played up (but that's unsearchable).

Logical-order (in-order DFS) iteration is obvious and trivial (though a lot of tasks are fine with physical-order (BFS) iteration. Beware that some tasks that nominally don't care about input order perform much better on already-sorted input though).

Conversion between logical index and physical index (if you need it - chances are you don't) can still be done efficiently, but requires some (fun!) thought especially for the non-power-of-2 tail.

Explicit prefetching is trivial to arrange for k-great grandchildren if needed (though the width of course grows exponentially). Using an n-ary rather than binary tree is possible but I'm not sure of the cache effects in practice. Switching to linear search just at the last level may be worth it? All of this means more complex logic though.

twic · on Aug 7, 2024

As an aside, what is "German" about these?

danvonk · on Aug 7, 2024

They were developed by the database group at TU Munich.

renewiltord · on Aug 6, 2024

Haha, quite a huge improvement in a point-release. 0.20.5 to 0.20.6.

This is a cool article. Good motivation. Good explanation. Plots. ~~One small thing is that the plots are missing a legend so I had to hover~~. Nevermind, I don't know why it didn't show (or why I thought that) because I can clearly see them on the x-axis now.

Kalanos · on Aug 7, 2024

interesting. strings was supposed to be one of arrow's big features in comparison to numpy, right?

sakras · on Aug 7, 2024

They operate in somewhat different domains - numpy is to numerical data as pyarrow is to dataframe data. This means Arrow has strings, maps, nested types, etc.

Note that I compared to pyarrow and not Arrow - Arrow refers to the arrangement of bytes in memory and isn’t tied to Python or anything.

padthai · on Aug 6, 2024

I am getting Vietnam flashbacks of all the small incompatibilities between Numpy arrays and Pandas Series

grahamjameson · on Aug 6, 2024

https://youtu.be/J0zjJKBK9ys