Has anyone built a production grade regex engine using derivatives? I don't thin...

eru · on March 14, 2021

I've made some attempts, but nothing production grade.

About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

One way to handle them that works well when the characters in your classes are mostly next to each other unicode, is to express your state transition function as an 'interval map'

What I mean is that eg a hash table or an array lets you build representations of mathematical functions that map points to values.

You want something that can model a step function.

You can either roll your own, or write something around a sorted-map data structure.

Eg in C++ you'd base the whole thing around https://en.cppreference.com/w/cpp/container/map/upper_bound (or https://hackage.haskell.org/package/containers-0.4.0.0/docs/... in Haskell.)

The keys in your sorted map are the 'edges' of your characters classes (eg where they start and end).

Does that make sense? Or am I misunderstanding the problem?

> I personally always get stuck at how to handle things like captures [...]

Let me think about that one for a while. Some Googling suggests https://github.com/elfsternberg/barre but they don't seem to support intersection, complement or stripping prefixes.

What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

burntsushi · on March 14, 2021

> About large character classes: how are those harder than in approaches? If you build any FSM you have to deal with those, don't you?

I mean specifically in the context of derivatives. IIRC, the formulation used in Turon's paper wasn't amenable to large classes.

Yes, interval sets work great: https://github.com/rust-lang/regex/blob/master/regex-syntax/...

This is why I asked if a production grade regex engine based on derivatives exists. Because I want to see how the engineering is actually done.

> What do you want your capture groups to do? Do you eg just want to return pointers to where you captured them (if any)?

Look at any production grade regex engine. It will implement captures. It should do what they do.

> I have an inkling that something inspired by https://en.wikipedia.org/wiki/Viterbi_algorithm might work.

Nothing about Viterbi is fast, in my experience implementing it in the past. :-)

> https://github.com/google/redgrep/blob/main/parser.yy mentions something about capture, but not sure if that has anything to do with capture groups.

It looks like it does, and in particular see: https://github.com/google/redgrep/blob/6b9d5b02753c4ece17e2f...

But that's only for parsing the regex itself. I don't see any match APIs that utilize them. I wouldn't expect to either, because you can't implement capturing inside a DFA. (You need a tagged DFA, which is a strictly more powerful thing. But in that case, the DFA size explodes. See the re2c project and their associated papers.)

If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.

eru · on March 15, 2021

> If I'm remembering correctly, I think the problem with derivatives is that they jump straight to a DFA. You can't do that in a production regex engine because a DFA's worst case size is exponential in the size of the regex.

Oh, that's interesting! Because I actually worked on some approaches that don't jump directly to the DFA.

The problem is the notion of (extended) NFA you need is quite a bit more complicated when you support intersection and complement.

burntsushi · on March 15, 2021

Indeed. And in the regex crate and RE2, for example, captures are only implemented in the "NFA" engines (specifically, the PikeVM and the bounded backtracker). So if you support captures, then those engines have to be able to support everything.

tsegratis · on March 16, 2021

Can I ask; what about a zdd?

The seem similar to closed languages with a disjunct and conjunct

Though I don't think I will, I was considering adding zdd or bdd to a PEG, to provide that conjunct

ofc, sat solver can represent a regex with conjuncts, but is this a good way of going about it, particularly with unbounded strings??

Would love to hear your thoughts on that