> 1) it's guaranteed linear time in the length of the input What’s the multiplic...

pcwalton · on Feb 4, 2022

For most regexes, backtracking and Thompson NFA have the same asymptotic complexity, which is why most languages adopted backtracking. The implementors of such languages knew what they were doing, especially when you consider that by adopting the Thompson NFA means you give up backreferences. The differences only arise with pathological regexes.

I used to think that backtracking was superior to the Thompson NFA in practice on typical regexes, but modern implementations of the Thompson NFA have shown that I was wrong in that and the Thompson NFA can match backtracking's performance. Still, the situation isn't nearly as simple as Russ's article makes it out to be by only focusing on /a?a?a?aaa/, a regex which nobody would write. (This is not to deny that people do write pathological regexes from time to time. But what's the size of the userbase that writes pathological regexes compared to the size of the userbase that uses backreferences?)

burntsushi · on Feb 4, 2022

FWIW, I would say that it's difficult for a Thompson NFA on its own to beat backtracking in non-pathological cases. So I actually think your prior is still mostly correct.

Now a hybrid NFA/DFA (or a "lazy DFA") that does subset construction at search time using a Thompson NFA can definitely beat out a backtracker. A lazy DFA will generally match the speed of a fully compiled DFA, and is about an order of magnitude faster than a Thompson NFA simulation:

    [andrew@frink regex-automata]$ regex-cli find nfa thompson pikevm "@$medium" '(?m)^\w{30}$'
    build pike vm time:  6.584596ms
     create cache time:  278.231µs
           search time:  7.798138892s
                counts:  [3]
    [andrew@frink regex-automata]$ regex-cli find hybrid dfa "@$medium" '(?m)^\w{30}$'
               parse time:  24.256µs
           translate time:  20.202µs
         compile nfa time:  5.966991ms
               nfa memory:  486196
    build hybrid dfa time:  3.137µs
        hybrid dfa memory:  486196
        create cache time:  21.793µs
              search time:  406.746917ms
        cache clear count:  0
                   counts:  [3]
    [andrew@frink regex-automata]$ regex-cli find dfa dense "@$medium" '(?m)^\w{30}$'
                parse time:  22.824µs
            translate time:  15.513µs
          compile nfa time:  6.000195ms
                nfa memory:  486196
    compile dense dfa time:  106.35009ms
          dense dfa memory:  4501024
     dense alphabet length:  117
              dense stride:  128
               search time:  448.568888ms
                    counts:  [3]
    $ ls -lh $medium
    -rw-rw-r-- 1 andrew users 280M Jul 14  2021 /home/andrew/data/benchsuite/subtitles/2018/OpenSubtitles2018.raw.sample.medium.en

(This is using an yet-unreleased tool as part of ongoing regex development.)

tedunangst · on Feb 4, 2022

People will gleefully write that regex if it causes a denial of your service. People would similarly blow up ftp servers with "ls starstarstarstar" globs.

jerf · on Feb 4, 2022

Arguably the best implementation would be a backtracking-based implementation that goes fast, with some sort of strong API for controlling the amount of effort the implementation puts into matching, because the major concern is the truly pathological cases. For the most part I'm not all that concerned about cases where one or the other implementation is slightly faster on this or that input, I'm worried about the DOS attack. However, determining the correct amount of "effort" to put into a match could be nontrivial; there's the obvious "time out after this amount of clock time" which has its uses for security, but if you want something more sophisticated (including measuring by actual effort applied rather than just clock time, which only loosely correlates with how long you've actually had to run) it's difficult to even know exactly how to express what you want, let alone implement it sensibly in the current programming environment.

If such a thing exists, I haven't seen it, but that's not to say it doesn't exist. In general, though, programming languages and libraries have poor-to-nonexistent generalized support for bounding the amount of resources to spend on a given internal computation. Because that's obviously a rather niche use that's hard to justify a lot of effort, especially since "bound the amount of resources for a given program", a well-implemented bit of functionality, tends to hit a lot of the use cases.