For any other state machine nuts such as myself, you won't want to miss their pa...

panic · on March 1, 2019

Also worth mentioning is Danny Dubé's SILex (http://wiki.call-cc.org/eggref/5/silex), which extracts submatches from a DFA trace by walking the trace in reverse, reconstructing the corresponding NFA trace (or traces) using a side table generated during determinization. The technique is described in these two papers: http://www.iro.umontreal.ca/~feeley/papers/DubeFeeleyACTAINF... http://www.schemeworkshop.org/2006/14-dube.pdf

jhallenworld · on March 1, 2019

It's interesting that it's using Laurikari's algorithm- I looked at the 2000 paper years ago and could not get a working algorithm out of it. Eventually I found another way to do this that I use for JOE's regular expression library. So now I'm curious which one is faster...

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

It uses a variant a Thompson's matcher, see:

https://swtch.com/~rsc/regexp/regexp1.html

"However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance. The Eighth Edition Unix regexp(3) library implemented such an algorithm as early as 1985, though as explained below, it was not very widely used or even noticed. "

So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).

burntsushi · on March 2, 2019

> So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).

Yeah, this is the standard way to do it. It's what I meant by "fall back to a slower engine," since simulating the NFA is much slower than a DFA.

pjmlp · on March 1, 2019

Thanks, got something to read over the weekend. :)