Totally agreed: almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood. Simplicity of implementation is of value for educational purposes, but we could easily have a small, simple 3rd party package for that. Go's regexp package is kinda too complex for a simple educational demonstration and too simple to be fast. :-)
I actually tried BurntSushi's https://github.com/BurntSushi/rure-go (bindings to Rust's regex engine) with GoAWK and it made regex handling 4-5x as fast for many regexes, despite the CGo overhead. However, rure-go (and CGo in general) is a bit painful to build, so I'm not going to use that. Maybe I'll create a branch for speed freaks who want it.
I've also thought of using https://gitlab.com/cznic/ccgo to convert Mawk's fast regex engine to Go source and see how that performs. Maybe on the next rainy day...
Yeah, that's a good idea, I did consider it, but haven't tried it yet. Do you hook and look at the regex string before it's compiled, or do you hook in at the parsed regex AST level? (eg: regexp/syntax in Go).
In the go code where I did this, it was a little different, with a static pattern. Something like "(\w+) apple" to find all apple adjectives or whatever, but the regexp wasted so much time matching words before not apples. A quick scan for "apple" to eliminate impossible matches made it faster. This depends more on knowing regex and corpus, so probably less relevant for awk.
It's been a while since I've looked at the source code, but it is almost certainly already doing basic prefix literal optimizations.
The more advanced literal optimizations come into play when every match must start with one of a few possible characters or strings. Then you get into things like Aho-Corasick or fancy SIMD algorithms (Teddy).
Oh, regarding rure go, the bugs note about using int is inaccurate. Go spec says max length of any object can be stored in int. You can't build a too big haystack.
> almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood.
that works up until you're at scale and the complexity of the implementation causes hard to understand edge conditions and just bugs from the complexity being beyond what the developers are able to handle.
Exactly, at some point code is such a mess that further optimization and features are harder to get merged. And the smaller the number of contributors the bigger the problem gets, till the main maintainer calls it quits and there's need for a completely new library.
Is it also true when even the essential complexity of the project is quite demanding? Like runtimes, especially with JIT compilers are not easy to begin with.
I actually tried BurntSushi's https://github.com/BurntSushi/rure-go (bindings to Rust's regex engine) with GoAWK and it made regex handling 4-5x as fast for many regexes, despite the CGo overhead. However, rure-go (and CGo in general) is a bit painful to build, so I'm not going to use that. Maybe I'll create a branch for speed freaks who want it.
I've also thought of using https://gitlab.com/cznic/ccgo to convert Mawk's fast regex engine to Go source and see how that performs. Maybe on the next rainy day...