Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: u8p – A Go Utility for Precise UTF-8 String Truncation (github.com/catatsuy)
3 points by catatsuy on May 4, 2024 | hide | past | favorite | 3 comments
I'm excited to share a new Go package that I've been working on, called `u8p`. It's designed to help developers handle UTF-8 encoded strings safely and efficiently, especially when you need to truncate these strings without causing data corruption.

The core of `u8p` is the `Find` function, which identifies the index of the leading UTF-8 byte in a string. This can be incredibly useful when dealing with log data or any large text content that needs to be segmented or truncated without damaging the integrity of the UTF-8 encoding.

For example, in scenarios where sending entire logs can overwhelm server resources, knowing where to safely cut a UTF-8 string can save both bandwidth and computational overhead, while preserving the validity of the data.

Additionally, I've tested the u8p package for safety using fuzzing. Fuzzing tests how the utility handles many different inputs, including unexpected or incorrect data. This helps ensure that u8p does not crash or act strangely. It's reliable for use in real-world applications where data needs to be secure and stable.

I believe `u8p` can be a valuable package for anyone who needs to manage large datasets or logs in Go and would love to hear your thoughts or any feedback. Contributions are also welcome!

Thanks for checking it out!

https://github.com/catatsuy/u8p



I don't exactly understand the exact motivation of this package. Yes, truncating a UTF-8 string to a byte size limit without making it invalid is a valid problem. But if it were the only motivation, the function signature ought to be:

    func LimitPrefix(a string, n int) string
...and it should never have an error condition (which does allocate memory). The name and signature should immediately suggest the following requirements:

1. `LimitPrefix(a, n)` should always return some prefix of `a`, namely `a[:m]` where 0 < m <= n.

1a. In particular, `m` can be zero if `n` is small enough.

1b. And `m` is expected to equal to `len(a)` if `len(a) >= n`.

2. `a[:m]` should of course be a valid UTF-8 string if `a` already was.

3. `m` should be maximized under these conditions.

There are some edge cases that we have to fill in as well:

4. The conditions 1 and 1b are a reasonable expectation even for non-UTF-8 inputs. They are also easy to guarantee.

5. The condition 2 can't be efficiently extended for non-UTF-8 inputs and no justifiable use cases exist.

6. However the condition 3 depends on the condition 2 (and 1 of course). Therefore it should be replaced with something concrete, otherwise we risk an unintentional incompatibility.

7. Negative n may arise from a size calculation with a missing bound check, so treating it as zero sounds fair.

The current design, in comparison, just finds the last UTF-8 lead byte within `a[:l]`. It doesn't even help with the truncation: both `Find("thirty", 5)` and `Find("dreißig", 5)` return 4, but `"thirty"[:5]` is valid while `"dreißig"[:5]` is invalid. Also `Find("one", 5)` unexpectedly fails! An arbitrary condition of `l <= 3` is even more confusing.

---

Based on aforementioned conditions, I propose the following instead (warning: never tested):

    func LimitPrefix(a string, n int) string {
        if len(a) >= n { // Condition 1b
            return a
        }

        n = max(n, 0)      // Condition 7
        n = min(n, len(a)) // Condition 1

        bound := n - 4        // Condition 3: Assume that a[n-4:n] has one or more lead bytes.
        bound = max(bound, 0) // Condition 1a

        var i int
        extent := 4 // Condition 6: Do not truncate if no lead byte is found.
        for i = n - 1; i > bound; i-- {
            switch a[i] >> 4 {
            case 0, 1, 2, 3, 4, 5, 6, 7:
                extent = 1
                break
            case 8, 9, 0xa, 0xb:
                // Continuation byte
            case 0xc, 0xd:
                extent = 2
                break
            case 0xe:
                extent = 3
                break
            case 0xf:
                extent = 4
                break
            }
        }

        if i+extent >= n { // Condition 2
            return a[:i+extent]
        } else {
            return a[:i]
        }
    }


Thank you for your reply.

I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

I don’t return the string itself because I don’t know if users want the start or the end of the string. Also, I want to avoid copying large strings. It’s up to the users how they use this function.

Since no one is using this package yet, we might consider changing the interface.


> I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

This highly depends on the use case, and your stated use case doesn't seem to need any sort of error.

> Also, I want to avoid copying large strings.

Go strings are not copied in that way; they are implemented like immutable slices [1].

[1] https://research.swtch.com/godata




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: