Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Tiny public domain regex library with UTF-8 support (github.com/torstenvl)
15 points by torstenvl on Oct 2, 2022 | hide | past | favorite | 4 comments



Pretty simple low-impact code here. I wrote it to address three short-comings the most popular alternative has.

The version I wrote compares UTF-8 code points, not bytes. It also uses locale-aware standard library functions, so things like using \w with non-English text should work, so long as you set the locale in your program.

Second, my library deals with the regex as it is, without spending time compiling the regex. The popular alternative wastes a lot of time compiling regexes even though the compiled versions can't easily be re-used.

Third, my library intentionally doesn't support regex character sets/ranges. For my use case, it's preferable to define another metacharacter class than to waste compute cycles on more complicated syntactic sugar.

If these things matter to anyone else, maybe you'll find my tiny project helpful.


Good job! Simple and pretty elegant.

Just curious, is there a reason to use wide-character versions of isdigit, isalnum etc. and pass there single UTF-8 byte? `iswdigit(cdpt(txt))` would seem more logical to me than `iswdigit(txt[0])`.


Thanks! And good catch, it's fixed now.

I found something interesting while trying to add a test case. I had to come up with something that would fail iswalnum() in ranges above 128, and I knew Japanese has its own punctuation marks, so I tried that. However, apparently, iswxxxxx is broken for East Asian languages and Arabic in macOS's system library (including with locale set to, e.g., ja_JP.UTF-8). There are characters for which iswalpha, iswcntrl, iswdigit, iswpunct, and iswspace are all false, in violation of the standard.

For the purposes of testing my library (rather than underlying OS locale support), I ended up going with Armenian, but it's a disappointing issue nonetheless.

I may have to roll my own at some point.


I'm not sure that complete reliability in this matter can be achieved without dragging ICU into a project, which is a mildly depressing thought.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: