Show HN: u8p – A Go Utility for Precise UTF-8 String Truncation

lifthrasiir · on May 4, 2024

I don't exactly understand the exact motivation of this package. Yes, truncating a UTF-8 string to a byte size limit without making it invalid is a valid problem. But if it were the only motivation, the function signature ought to be:

    func LimitPrefix(a string, n int) string

...and it should never have an error condition (which does allocate memory). The name and signature should immediately suggest the following requirements:

1. `LimitPrefix(a, n)` should always return some prefix of `a`, namely `a[:m]` where 0 < m <= n.

1a. In particular, `m` can be zero if `n` is small enough.

1b. And `m` is expected to equal to `len(a)` if `len(a) >= n`.

2. `a[:m]` should of course be a valid UTF-8 string if `a` already was.

3. `m` should be maximized under these conditions.

There are some edge cases that we have to fill in as well:

4. The conditions 1 and 1b are a reasonable expectation even for non-UTF-8 inputs. They are also easy to guarantee.

5. The condition 2 can't be efficiently extended for non-UTF-8 inputs and no justifiable use cases exist.

6. However the condition 3 depends on the condition 2 (and 1 of course). Therefore it should be replaced with something concrete, otherwise we risk an unintentional incompatibility.

7. Negative n may arise from a size calculation with a missing bound check, so treating it as zero sounds fair.

The current design, in comparison, just finds the last UTF-8 lead byte within `a[:l]`. It doesn't even help with the truncation: both `Find("thirty", 5)` and `Find("dreißig", 5)` return 4, but `"thirty"[:5]` is valid while `"dreißig"[:5]` is invalid. Also `Find("one", 5)` unexpectedly fails! An arbitrary condition of `l <= 3` is even more confusing.

---

Based on aforementioned conditions, I propose the following instead (warning: never tested):

    func LimitPrefix(a string, n int) string {
        if len(a) >= n { // Condition 1b
            return a
        }

        n = max(n, 0)      // Condition 7
        n = min(n, len(a)) // Condition 1

        bound := n - 4        // Condition 3: Assume that a[n-4:n] has one or more lead bytes.
        bound = max(bound, 0) // Condition 1a

        var i int
        extent := 4 // Condition 6: Do not truncate if no lead byte is found.
        for i = n - 1; i > bound; i-- {
            switch a[i] >> 4 {
            case 0, 1, 2, 3, 4, 5, 6, 7:
                extent = 1
                break
            case 8, 9, 0xa, 0xb:
                // Continuation byte
            case 0xc, 0xd:
                extent = 2
                break
            case 0xe:
                extent = 3
                break
            case 0xf:
                extent = 4
                break
            }
        }

        if i+extent >= n { // Condition 2
            return a[:i+extent]
        } else {
            return a[:i]
        }
    }

catatsuy · on May 4, 2024

Thank you for your reply.

I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

I don’t return the string itself because I don’t know if users want the start or the end of the string. Also, I want to avoid copying large strings. It’s up to the users how they use this function.

Since no one is using this package yet, we might consider changing the interface.

lifthrasiir · on May 5, 2024

> I was thinking about whether to return an error. If we can’t find a UTF-8 start byte in the nearby 4 bytes, it’s unclear what to return. I thought maybe we could ignore this problem.

This highly depends on the use case, and your stated use case doesn't seem to need any sort of error.

> Also, I want to avoid copying large strings.

Go strings are not copied in that way; they are implemented like immutable slices [1].

[1] https://research.swtch.com/godata