Is there a GCC option to error on non-standard English characters?

Terr_ · on March 31, 2024

Disclosure: I've got zero C/C++ on my resume. I was asked to diagnose a kernel panic and backport kernel security patches once, but it was very uncomfortable. ("Hey, Terr knows the build system, that's close enough, right?")

That said, perhaps something like disabling the default -fextended-identifiers [0], and enabling the -Wbidi-chars [1] warning.

[0] https://gcc.gnu.org/onlinedocs/gcc/Preprocessor-Options.html...

[1] https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#inde...

pajko · on March 31, 2024

Cool, the latter was added to fix CVE-2021-42574: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103026

rurban · on March 31, 2024

Not yet. I was working on the standardization for C23, but this was postponed to C26, and has not much support. MSVC and SDCC liked it, CLANG and GC not so much.

Here is my bad variant of the feature, via confusables: https://github.com/rurban/gcc/tree/homoglyph-pr103027

The better variant would be to use my libu8ident, following UTR 39. I only did that for binutils.

masklinn · on March 31, 2024

Error-ing is the point here and what the period achieved. It’s a feature detection snippet so if it fails to compile the feature is disabled.

rmast · on March 31, 2024

It seems like there should be a way to catch these types of “bugs” - some form of dynamic analysis tool that extracts the feature detection code snippets and tries to compile them; if they fail for something like a syntax error, flag it as a broken check.

Expanding macros on different OSes could complicate things though, and determining what flags to build the feature check code with — so perhaps filtering based on the type of error would be best done as part of the build system functionality for doing the feature checking.

humanrebar · on March 31, 2024

I'd prefer if the ecosystem standardized on some dependency management primitives so critical projects aren't expected to invent buggy and insecure hacks ("does this strong parse?") in order to accurately add dependencies.

rmast · on March 31, 2024

It would be interesting to see what the most common compile feature checks are for, and see what alternative ways could be used to make the same information available to a build system — it seems like any solution that requires libraries being updated to “export” information on the features they provide would have difficulties getting adoption (and not be backwards compatible with older versions of desired dependencies).

estebank · on March 31, 2024

From my experience looking at rust builds, Pareto applies here: most checks are trivial and used by a lot, and a handful are super complex and niche.

masklinn · on March 31, 2024

> if they fail for something like a syntax error, flag it as a broken check.

A syntax error might be exactly what they’re looking for e.g. they’re feature testing a new bit of syntax or a compiler extension.

> so perhaps filtering based on the type of error would be best done as part of the build system functionality for doing the feature checking.

Which would require every compiler to have detailed, consistent, and machine-readable failure reporting.

rmast · on March 31, 2024

At least for newer C++ standards it seems like there is decent support for feature test macros, which could reduce the need for a feature check involving compiling a snippet of test code to decide if a feature is available: https://en.cppreference.com/w/cpp/feature_test

Handling the output from several of the most recent GCC and Clang versions would probably cover the majority of cases, and add in MSVC for Windows. If the output isn’t from a recognized compiler or doesn’t match expectations, the only option is falling back to current behavior. Not ideal, but better than the current status quo…

estebank · on March 31, 2024

That would only work for projects that care only about current compilers, where C in general has more desire to support niche compilers.

A mitigation here would be to make result of autoconf only provide instructions for humans to change their build config, instead of doing it for them silently. The latter is an anti-pattern.

FWIW, the approach you propose is how the UI tests for rustc work, with checks for specific annotations on specific lines, but those have the benefit of being tied to a single implementation/version and modified in tandem with the app. Unless all compilers could be made to provide reasonable machine readable output for errors, doing that for this use case isn't workable.

humanrebar · on March 31, 2024

Universal support of SARIF by compilers would be most of that.

patrakov · on March 31, 2024

Well - that would be exactly the point of the attacker. GCC errors out, and it does not matter whether this is because the intentionally typoed header does not exist or because non-English characters are not allowed. Errors encountered during compilation of test programs go to /dev/null anyway, as they are expected not to compile successfully on some systems - that's exactly the point of the test. So no, this option would not have helped.

pajko · on March 31, 2024

Probably the code review tools should be hardened as well, to indicate if extended identifiers had been introduced to a line where there wasn't any. That would help catching the replacement of a 'c' character with a Russian one.

Btw, the -fno-extended-identifiers compiler parameter gives an error if UTF-8 identifiers are used in the code: <source>:3:11: error: stray '\317' in program float <U+03C9><U+2083> = 0.5f;

Terr_ · on March 31, 2024

> Probably the code review tools should be hardened as well, to indicate if extended identifiers had been introduced to a line where there wasn't any.

Maybe in the future more languages/ tools will have the concept of per-project character sets, as opposed to trying to wrangle all possible Unicode ambiguity problems.

I suppose then the problem is how to permit exceptions when integrating with some library written in another (human) language.

account42 · on April 8, 2024

Or we could just accept English as the lingua franca of computing and not try to support anything other than ASCII in source code (at least not outside string constants). That way not only do we eliminate a whole class of possible exploits but also widen the number of people who can understand the code and spot issues.

fweimer · on March 31, 2024

The -fno-extended-identifiers option seems to do something in this area, but I don't know if it is sufficient. But it may block some characters which are used in standard English (for some values of standard).

account42 · on April 8, 2024

> But it may block some characters which are used in standard English

So what? Source code was fine with ASCII for a long time, this push for unicode symbols is a recent endeavor and IMO a huge mistake not just because of the security impliciations.

jcelerier · on March 31, 2024

I mean, GCC erroring is how the exploit works here. cmake tries to compile the source code: if it works then the feature is available, if it fails then the feature is not available. Forcing the build to fail is what the attacker wants.