Why are so many American models multi-lingual, supporting hundreds of languages ...

achierius · 2025-04-30T16:50:47 1746031847

That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages, like American models are English-first while doing the same, but that's not what's happening.

overfeed · 2025-04-30T17:01:24 1746032484

> That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages

What would a hypothetical "Mandarin-first" model look like to you?

I challenge the notion that the current models are "English-first" - that is an unsubstantiated opinion not supported by fact. I bet, dollars to donuts, these models are SoTA in Mandarin as well. When framed that way, asking "Why are they marketed as English-speaking models outside of China" or "Why are they really good at English" are simply not interesting questions - they have obvious answers.

yorwba · 2025-05-01T07:19:23 1746083963

> What would a hypothetical "Mandarin-first" model look like to you?

Given a language-agnostic prompt like "12 + 89", any explanatory text it outputs could be expected to be in Mandarin most of the time.

According to this test, Xiaomi's MiMo-7B-RL is an English-first model.

overfeed · 2025-05-01T19:03:09 1746126189

"12 + 89" uses the latin alphabet and is in no way language-agnostic in this context. I expect borrowed constructs to appear relatively more frequently in the language they were borrowed from.

Now I'm curious how Mistral models would respond to a "language-agnostic" phrases like "Rendezvous" or "coup d'etat"

yorwba · 2025-05-01T21:36:52 1746135412

You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions, but the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese characters is literally as uncommon as someone writing "twelve plus eighty-nine" in English.

In contrast, your examples would be spelled « rendez-vous » and « coup d’État » in French, i.e. easily distinguishable from their English descendants.

overfeed · 2025-05-06T20:16:08 1746562568

> You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions

No need for scare-quotes, Latin script is a proper noun and a technical term with precise meaning wrt text encoding - not "what I think."

> the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese

Which unicode code points do the Mandarin speakers and "numerous other scripts" use to write "12 + 89"? Could it be the very same code points as Latin script, which then are tokenized to the same vectors that the LLMs learn to associate more with English text rather than CJK in the latent space?

> i.e. easily distinguishable from their English descendants.

You're making broad assumptions about the tokenization design here that do not apply universally.

yorwba · 2025-05-07T12:33:01 1746621181

Precisely because the exact same codepoints are used for digits and mathematical symbols, there's nothing script-specific about them and their linguistic association is determined by the training data mixture. A model trained predominantly on text scraped from Chinese websites would learn to associate them more with Mandarin than English in the latent space, since that would be the context where they most often appear.