Simplicity of implementation isn't what users need, though; they need performanc...

benhoyt · on Feb 4, 2022

Totally agreed: almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood. Simplicity of implementation is of value for educational purposes, but we could easily have a small, simple 3rd party package for that. Go's regexp package is kinda too complex for a simple educational demonstration and too simple to be fast. :-)

I actually tried BurntSushi's https://github.com/BurntSushi/rure-go (bindings to Rust's regex engine) with GoAWK and it made regex handling 4-5x as fast for many regexes, despite the CGo overhead. However, rure-go (and CGo in general) is a bit painful to build, so I'm not going to use that. Maybe I'll create a branch for speed freaks who want it.

I've also thought of using https://gitlab.com/cznic/ccgo to convert Mawk's fast regex engine to Go source and see how that performs. Maybe on the next rainy day...

tedunangst · on Feb 4, 2022

Have you considered writing your own string matcher for the simple cases like fixed patterns?

I got some pretty solid wins just by guarding some regex executions with simple strings.indexof calls.

benhoyt · on Feb 4, 2022

Yeah, that's a good idea, I did consider it, but haven't tried it yet. Do you hook and look at the regex string before it's compiled, or do you hook in at the parsed regex AST level? (eg: regexp/syntax in Go).

tedunangst · on Feb 4, 2022

For something like awk, I think you'd look before compiling, then create your own matcher. With an abstract Matcher interface that regexp implements.

It's C, but openbsd grep does something like this because libc regex is super slow. Look for fastcomp on https://github.com/openbsd/src/blob/master/usr.bin/grep/util... It's not super sophisticated, but enough to beat the full regex engine.

In the go code where I did this, it was a little different, with a static pattern. Something like "(\w+) apple" to find all apple adjectives or whatever, but the regexp wasted so much time matching words before not apples. A quick scan for "apple" to eliminate impossible matches made it faster. This depends more on knowing regex and corpus, so probably less relevant for awk.

burntsushi · on Feb 4, 2022

Go's regexp package even exposes a routine for this: https://pkg.go.dev/regexp#Regexp.LiteralPrefix

It's been a while since I've looked at the source code, but it is almost certainly already doing basic prefix literal optimizations.

The more advanced literal optimizations come into play when every match must start with one of a few possible characters or strings. Then you get into things like Aho-Corasick or fancy SIMD algorithms (Teddy).

tedunangst · on Feb 4, 2022

Oh, regarding rure go, the bugs note about using int is inaccurate. Go spec says max length of any object can be stored in int. You can't build a too big haystack.

burntsushi · on Feb 4, 2022

Ah that's right! Nice catch, thanks.

alex_muscar · on Feb 4, 2022

I think GNU grep does something similar. When it has a fixed patter it uses Boyer-Moore [1].

[1]: https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

lamontcg · on Feb 4, 2022

> almost all users (me/GoAWK included) want performance and don't care nearly as much about simplicity under the hood.

that works up until you're at scale and the complexity of the implementation causes hard to understand edge conditions and just bugs from the complexity being beyond what the developers are able to handle.

epolanski · on Feb 4, 2022

Exactly, at some point code is such a mess that further optimization and features are harder to get merged. And the smaller the number of contributors the bigger the problem gets, till the main maintainer calls it quits and there's need for a completely new library.

kaba0 · on Feb 4, 2022

Is it also true when even the essential complexity of the project is quite demanding? Like runtimes, especially with JIT compilers are not easy to begin with.

lamontcg · on Feb 4, 2022

It is even more important then.

tcmart14 · on Feb 4, 2022

I don't exactly agree. Sure end users don't care about implementation directly. But simplicity of implementation does affect them indirectly. Go is already over 10 years old with maybe many more years ahead. All code bases rot. I think the simpler the implementation, the easier it is to cure rot and code smells which hopefully means Go has a long life as the implementation becomes easier to work on over time. While user's maybe don't care, it does impact them.

pcwalton · on Feb 4, 2022

You can make the same argument about CPUs. Modern CPUs are horrendously complex. But nobody is asking to remove, say, out-of-order execution on simplicity grounds, because that would hurt users and cost millions of dollars at scale for no reason other than engineering aesthetics.

It's only in a few areas, like programming languages and operating systems, that we're obsessed with simplicity, and it makes little sense to me.

Beltalowda · on Feb 4, 2022

Branch prediction is a CPU "complexity" that got us in to some amount of trouble.

I don't see simplicity as a "virtue", as such: it's all about the ability to reason about a system. The more complex it is, the harder this becomes. This makes it harder for implementers to work on it, and harder for users to figure out what is going wrong.

On the other hand, complexity often offers useful enhancements such as features or performance. There is some amount of tension here, which is hardly unique to software: you see this in, say, tax systems, regulation/law, design of cars and all sort of things, etc.

tcmart14 · on Feb 4, 2022

I'd say it is probably because we are all worst at writing code than we'd like to imagine. So writing necessarily complex code, especially in a FOSS compiler or system, makes little sense since some day someone else is going to have to step in and learn it.

jchw · on Feb 4, 2022

I agree, but simplicity of implementation is a net positive in a vacuum. When balanced against things like performance, it's definitely worth some trade-offs for better performance... but simplicity of implementation definitely has lots of upsides that users indirectly benefit from. Therefore, I think it's important to at least have a balance.

pdpi · on Feb 4, 2022

Simplicity of implementation also contributes to Go’s fast compile times, which is a different sort of performance. Trying to find a sweet spot between “slow” interpreted languages and “fast” compiled languages with long compile times (e.g. C++ template hell) is a worthy goal.

kaba0 · on Feb 4, 2022

I think multi-profile compilations is much better - have a really fast debug build that hardly does any optimizations, and a release one that can take whatever amount of time but will be really optimized.

Cthulhu_ · on Feb 4, 2022

> Simplicity of implementation isn't what users need, though; they need performance.

It's a tradeoff, in the end. I mean sure, users don't really need to know how things work under the hood, but the people building and maintaining the language do; Go's philosophies on maintainability extend to the language's internals as well.

This is one reason why generics took over ten years to land; they needed to find the middle ground between functionality and sanity within the compiler. Java's generics implementation takes up a huge chunk of the spec and tooling. I don't even want to know about Scala's. It added so much more complexity to the language that I'm not surprised Java stagnated for as long as it did.

kaba0 · on Feb 4, 2022

> Java's generics implementation takes up a huge chunk of the spec and tooling.

Does it? It is a pretty straightforward generic implementation, imo.

rplnt · on Feb 4, 2022

Regexp is one of the things you see attacks on all the time (mostly DOS). Users care a lot about security, and simplicity of implementation correlates with it. It's not something users need, but they do benefit from it.

cfors · on Feb 4, 2022

Performance is relative, so I'm not really sure the point being made here. Sure if your program has regex as the constrained resource this matters, but again it's all relative.

Philip-J-Fry · on Feb 4, 2022

Simplicity is what users indirectly need.

Go is not about providing the fastest implementations out of the box, it's about having a broad toolset in the standard library to solve the problems Go was built for.

Faster (and often more complex) implementations are a maintenance burden for Go contributors. It's far better for a high performance regex library to be a third party package for those that need it.

For those where regex is a limiting factor in performance they'll soon find out why. But for most people fast regex is nothing compared to the overhead of a simple HTTP request.