Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I might have missed it in the article but I'm not sure why the prefix is stored for strings that can't be inlined.


Ctrl-F for "Some motivations are as follows" under the "String view with short string optimizations" section here: https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQ...

Copying here:

> Having the 4-byte prefix directly accessible (without indirection through an offset into a separate data buffer) can substantially improve the performance of comparisons returning false. This prefix can be encoded with multi-column hash keys to accelerate aggregations, joins. Sorts would likely also be significantly faster with this representation (experiments would tell for certain)

> Certain algorithms (for example “prefix of string” or “suffix of string” — e.g. PREFIX(“foobar”, 3) -> “bar”) can execute by manipulating StringView values only and not requiring any memory copying of large strings.

This document was an early proposal for adding what is now called the StringView (and ByteView) types to the Arrow format itself.


the first n bytes are likely by far the most often accessed in practices, specifically for sorting & filtering, etc. Storing them inline is likely a huge optimization for little cost.


It's zero cost, since you want the pointer to be 64bit aligned anyway.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: