I am trying to formalise this with Cosmopolitan Identifiers (https://obua.com/pu...

I am trying to formalise this with Cosmopolitan Identifiers (https://obua.com/publications/cosmo-id/3/). These identifiers consist of words and symbols. Symbols are normalised based on how they look like, and so Latin / Cyrillic / Greek symbols that look alike are mapped to the same symbol. Words are normalised differently, so that "Tree" and "tree" map to the same normal form. As a symbol, "T" and "t" are obviously different. I am not totally happy with the concept yet, I have implemented a fourth, simpler iteration of that concept as a Typescript package: https://www.npmjs.com/package/cosmo-id .

One of the problems is, how do you distinguish symbols and words? A simple way to do this is to classify something as a symbol if it is just a single character, and as a word otherwise. For example, "α-β" would consist of two symbols, separated by a hyphen, but "αβ" is a word and normalised to "av" based on some convention on how to "latinise" greek words.