jxndnxu's comments

jxndnxu · on June 20, 2024

How would if have helped? You either design a DFA by hand or use a compiled from a regular language

jxndnxu · on June 20, 2024

You have answered your question yourself: your algorithm looks at each byte twice, not once

It's even more obvious in the UTF case where the classic implementation first looks at 1-4 byte to parse a character and only then checks if it's a space

jxndnxu · on June 20, 2024

My understanding of the article's use of scalable was "fixed overhead more or less regardless of the complexity of the state machine and input" not "fastest implementation available"

jxndnxu · on June 20, 2024

Because classic wc is not iterating over every byte once, but multiple times.

It's especially obvious in the Unicode case where it first takes 1-4 bytes to get a Unicode character and then checks this character with another function to see if it's whitespace

But even with with naive ASCII approach, if you don't hand roll a state machine you are checking multiple conditions on each byte (is it a space and am I leaving a word etc)

Using a dfa has fixed compute per byte

kllrnohj · on June 20, 2024

Those 1-4 bytes are sitting in a register the entire time and thus basically free to read as often as you want, though.

An actual sampled profile showing the two approaches would be interesting. Naively it seems like it's just because it has faster UTF8 handling and nothing to do with being a state machine exactly

brainwad · on June 21, 2024

According to the authors it's also faster on files full of 'x' or ' ', so there must be more than just better unicode support.

yencabulator · on June 21, 2024

Even something as simple as wc calling a libc function for parsing utf-8, that doesn't get inlined, would destroy its performance relative to anything optimized.

Personally, I'd expect SIMD to win over all of these. wc sounds like kind of challenge that's very easy to partition and process in chunks, though UTF-8 might ruin that.

teo_zero · on June 22, 2024

> very easy to partition and process in chunks

Which counters to increment at each byte depends on the previous bytes, though. You could probably succeed using overlapping chunks, but I wouldn't call it very easy.

yencabulator · on June 22, 2024

That sort of "find the correct chunk boundary" logic was very common with all the mapreduce processing that was done when people still used the phrase big data.