I'm excited to share a new Go package that I've been working on, called `u8p`. It's designed to help developers handle UTF-8 encoded strings safely and efficiently, especially when you need to truncate these strings without causing data corruption.
The core of `u8p` is the `Find` function, which identifies the index of the leading UTF-8 byte in a string. This can be incredibly useful when dealing with log data or any large text content that needs to be segmented or truncated without damaging the integrity of the UTF-8 encoding.
For example, in scenarios where sending entire logs can overwhelm server resources, knowing where to safely cut a UTF-8 string can save both bandwidth and computational overhead, while preserving the validity of the data.
Additionally, I've tested the u8p package for safety using fuzzing. Fuzzing tests how the utility handles many different inputs, including unexpected or incorrect data. This helps ensure that u8p does not crash or act strangely. It's reliable for use in real-world applications where data needs to be secure and stable.
I believe `u8p` can be a valuable package for anyone who needs to manage large datasets or logs in Go and would love to hear your thoughts or any feedback. Contributions are also welcome!
Thanks for checking it out!
https://github.com/catatsuy/u8p
1. `LimitPrefix(a, n)` should always return some prefix of `a`, namely `a[:m]` where 0 < m <= n.
1a. In particular, `m` can be zero if `n` is small enough.
1b. And `m` is expected to equal to `len(a)` if `len(a) >= n`.
2. `a[:m]` should of course be a valid UTF-8 string if `a` already was.
3. `m` should be maximized under these conditions.
There are some edge cases that we have to fill in as well:
4. The conditions 1 and 1b are a reasonable expectation even for non-UTF-8 inputs. They are also easy to guarantee.
5. The condition 2 can't be efficiently extended for non-UTF-8 inputs and no justifiable use cases exist.
6. However the condition 3 depends on the condition 2 (and 1 of course). Therefore it should be replaced with something concrete, otherwise we risk an unintentional incompatibility.
7. Negative n may arise from a size calculation with a missing bound check, so treating it as zero sounds fair.
The current design, in comparison, just finds the last UTF-8 lead byte within `a[:l]`. It doesn't even help with the truncation: both `Find("thirty", 5)` and `Find("dreißig", 5)` return 4, but `"thirty"[:5]` is valid while `"dreißig"[:5]` is invalid. Also `Find("one", 5)` unexpectedly fails! An arbitrary condition of `l <= 3` is even more confusing.
---
Based on aforementioned conditions, I propose the following instead (warning: never tested):