I might have missed it in the article but I'm not sure why the prefix is stored ...

pgwhalen · on Aug 6, 2024

Ctrl-F for "Some motivations are as follows" under the "String view with short string optimizations" section here: https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...

Copying here:

> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)

> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.

This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.

make3 · on Aug 6, 2024

the first n bytes are likely by far the most often accessed in practices, specifically for sorting & filtering, etc. Storing them inline is likely a huge optimization for little cost.

NikkiA · on Aug 7, 2024

It's zero cost, since you want the pointer to be 64bit aligned anyway.