Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For any other state machine nuts such as myself, you won't want to miss their paper on tagged DFAs: http://re2c.org/2017_trofimovich_tagged_deterministic_finite... It is a somewhat recent follow on from Laurikari's 2000 paper and Kuklewicz 2007 paper on the same topic. Specifically, tagged DFAs let you do submatch extraction via the DFA itself. Other finite state machine regex libraries either don't do this at all, or fall back to slower engines in the general case (such as RE2). I think the downside of tagged DFAs is that they can be quite large!


Also worth mentioning is Danny Dubé's SILex (http://wiki.call-cc.org/eggref/5/silex), which extracts submatches from a DFA trace by walking the trace in reverse, reconstructing the corresponding NFA trace (or traces) using a side table generated during determinization. The technique is described in these two papers: http://www.iro.umontreal.ca/~feeley/papers/DubeFeeleyACTAINF... http://www.schemeworkshop.org/2006/14-dube.pdf


It's interesting that it's using Laurikari's algorithm- I looked at the 2000 paper years ago and could not get a working algorithm out of it. Eventually I found another way to do this that I use for JOE's regular expression library. So now I'm curious which one is faster...

https://sourceforge.net/p/joe-editor/mercurial/ci/default/tr...

It uses a variant a Thompson's matcher, see:

https://swtch.com/~rsc/regexp/regexp1.html

"However, Thompson-style algorithms can be adapted to track submatch boundaries without giving up efficient performance. The Eighth Edition Unix regexp(3) library implemented such an algorithm as early as 1985, though as explained below, it was not very widely used or even noticed. "

So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).


> So in JOE's matcher, when simulating the NFA with the multithreaded machine, each succeeding thread will correspond to one of the sub-match possibilities. Augment this machine to keep the ones you want (the ones with the longest matching strings starting from the left usually).

Yeah, this is the standard way to do it. It's what I meant by "fall back to a slower engine," since simulating the NFA is much slower than a DFA.


Thanks, got something to read over the weekend. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: