Has anyone built a production grade regex engine using derivatives? I don't think I've seen one. I personally always get stuck at how to handle things like captures or the very large Unicode character classes. Or hacking in look-around. (It's been a while since I've given this thought though, so I'm not sure I'll be able to elaborate much.)
I've made some attempts, but nothing production grade.
About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?
One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'
What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.
You want something that can model a step function.
You can either roll your own, or write something around a sorted-map data structure.
The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).
Does that make sense? Or am I misunderstanding the problem?
> I personally always get stuck at how to handle things like captures [...]
Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.
What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?
But that's only for parsing the regex itself. I don't see any match APIs that utilize them. I wouldn't expect to either, because you can't implement capturing inside a DFA. (You need a tagged DFA, which is a strictly more powerful thing. But in that case, the DFA size explodes. See the re2c project and their associated papers.)
If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.
> If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.
Oh, that's interesting! Because I actually worked on some approaches that don't jump directly to the DFA.
The problem is the notion of (extended) NFA you need is quite a bit more complicated when you support intersection and complement.
Indeed. And in the regex crate and RE2, for example, captures are only implemented in the "NFA" engines (specifically, the PikeVM and the bounded backtracker). So if you support captures, then those engines have to be able to support everything.