I imagine this works for regex engines like RE2 that don't support backtracking,...

burntsushi · on March 10, 2021

Hi again. :-)

In practice, production grade general purpose regex engines are never implemented with formal DFAs in all cases. They take too long to build and use way too much memory, especially when Unicode is involved. For example:

    $ regex-cli debug dfa dense '(?-u)\w{10}' -qm
                parse time:  45.418µs
            translate time:  14.626µs
          compile nfa time:  373.919µs
    compile dense dfa time:  564.7µs
          dense dfa memory:  5436
     dense alphabet length:  23
              dense stride:  32

    $ regex-cli debug dfa dense '\w{10}' -qm
                parse time:  41.344µs
            translate time:  27.072µs
          compile nfa time:  6.360692ms
    compile dense dfa time:  1.038092145s
          dense dfa memory:  2998332
     dense alphabet length:  115
              dense stride:  128

And this is including oodles of tricks for shrink the DFA's size. (Note though that the above examples minimize the DFA, which adds significant time. Without minimization, the latter is about an order of magnitude faster.)

So even this simple regex is 3MB in size.

Finally, DFAs can grow exponentially in the size of the regex, which means you can't really use them with untrusted regexes.

In practice, regex engines like RE2 use a hybrid NFA/DFA, or "lazy DFA." The DFA is computed at search time and cached. If too much cache is used, then the regex engine falls back to an NFA simulation.

Beating the performance of a DFA without going into more exotic things like vectorized algorithms or more specialized regex engines (bit-parallel NFAs) is pretty tricky.

And also, do not repeat the same mistake that the OP seems to make: conflating what DFA and NFA mean. A lot of people like to call things like PCRE an "NFA" implementation, but it's not, because meaningfully more powerful than what an NFA can do. General purpose regex engines tend to be split between ones that are based on finite automata and ones that are based on backtracking.

emidoots · on March 11, 2021

Thanks for the super detailed response again, Andrew! I have a ton to learn from you about how production grade regex engines work honestly :) Do you have any tips for learning as much as you have aside from mulling over papers and existing implementations?

Also, by "do not repeat the same mistake that the OP seems to make: conflating what DFA and NFA mean." do you mean in the article? I wrote that mostly with the intent to say _"this ain't how things are normally done, DFA/NFA exist and are used"_ but admittedly don't have more than a surface-level understanding of them. If you have tips on how to clarify that wording so I don't confuse others, at the very least, I'd much appreciate it! :)

burntsushi · on March 11, 2021

Yes, I mean in the article. Perhaps I was too brief, but the TL;DR is to use "finite automata based regex engines" and "backtracking based regex engines."

Basically, there's a long tradition of calling the latter "NFA engines" and the former "DFA engines." But doing so is inaccurate and confusing. Especially since "DFA engines" also include things called "NFA engines" that are different from the former "NFA engines" as used when describing backtracking engines.

If that doesn't make sense, then just remember this: DFAs and NFAs have equivalent expressive power. So if you're describing a regex engine that gives you thinks like backreferences, recursion and all sorts of other bells and whistles, then it can't be an NFA. It is almost certainly implemented with backtracking.

> Do you have any tips for learning as much as you have aside from mulling over papers and existing implementations?

I started with Russ Cox's article series and read the RE2 source code. Then I built my own.

I'd also check out anything related to Hyperscan. It's a deep rabbit hole to tumble down though.

emidoots · on March 11, 2021

Understood, this helped me a lot, thank you. I definitely had confusion around DFA/NFA being a _type of regex engine implementation_ and see now why that is incorrect.

I updated the article[0] so as to hopefully not confuse others, really appreciate you taking the time to respond to me again, and sorry for not understanding your earlier comment :)

[0] https://github.com/hexops/devlog/commit/0159b510556c2943cbca...

burntsushi · on March 11, 2021

No worries. Feel free to email me if you want to geek out about regexes. :-) My email is on my web site, linked in my profile.