Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why are there so many English-first AI models from China? Are they not interested in serving their own population? Or is it that if they publish Chinese-first models it won't get publicity in the West?


CommonCrawl [1] is the biggest and most easily accessible legally acquired crawling dataset around, collecting data since 2008. Pretty much everyone uses this as their base dataset for training foundation LLMs and since it's mostly English, all models perform well in English.

[1] https://commoncrawl.org/


Haven't we reached a situation where English is the de facto language of scientific research, especially AI benchmarks ?

It's clearly impossible for me to try anything in Chinese, I'd need a translation.


Correct. Lingua franca for at least the last 75 years, if not longer.


For publishing results, yes, but not necessarily for the generation part of it.


Less and less, it feels like, every year. I wonder if anybody has hard numbers on that.


One thing I thought was interesting about this paper [1] on understanding LLMs was how the models associate words/concepts in different languages with each other in what they call Multilingual Circuits.

So the example they give:

English: The opposite of "small" is " → big

French: Le contraire de "petit" est " → grand

Chinese: "小"的反义词是" → 大

Cool graphic for the above [2]

So while English is the lingua franca of the interenet and represents the largest corpus of data, the primary models being built are able to use an English dataset to build associations across languages. This might create significantly stronger AI and reasoning even for languages and regions that lack the data, tech and resources to build local models

[1] https://www.anthropic.com/research/tracing-thoughts-language...

[2] https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...


I assume a large portion of high quality training material is in English


You'd be correct. The largest portion of all languages in Common Crawl (aka the "whole open internet" training corpus) is English with 43%. No other language even reaches double digit percentages. The next biggest one is Russian at 6%, followed by German at 5%.


I wonder where are you getting your data. According to wikipedia russian is #7 https://en.wikipedia.org/wiki/Languages_used_on_the_Internet

Only place where russian is in top 5 is in Wikipedia views. Russian part of internet steadily goes down, as russian imperialism crumbles.


> The largest portion of all languages in Common Crawl

https://commoncrawl.github.io/cc-crawl-statistics/plots/lang...


Thanks!

I wonder where this discrepancy comes from


probably under-indexing of non-english sources by these crawlers.

would be interesting if yandex opened some data sets!


And lots of people write on the web using English as a second language, which both reduces the presence of their native language and increases the presence of English.


yep not a native english speaker here and yet my online footprint is mostly english due to software pushing me to learn it


My guess is that reference counting at depth=1 only capture non-$LANG content which text parts don't matter a lot, e.g. photo galleries.


Chinese internet mostly consists of a few closed gardens tightly controlled by big corps. Crawlers simply don't work when each company employs an army of engineers to guard their data. Many of the most popular websites are also app only. It's impossible to get the corpus necessary to train a good LLM.


DeepSeek claims they had 12% more Chinese tokens than English, in their training corpus for DeepSeek V2, FWIW.

https://arxiv.org/pdf/2405.04434#page=12

> Our tokenized pretraining corpus contains 8.1T tokens, where Chinese tokens are approximately 12% more than English ones.


Do we have estimates on the corpus that is available? This model's repo describes "multiple strategies to generate massive diverse synthetic reasoning data." FWIW, AI 2027 forecasts heavy emphasis on synthetic data creation.

Is the lack of existing corpus just an extra hurdle for Hanzi-first models that are also leading the pack in benchmarks?


All LLMs are trained on the same basic blob of data - mostly in English, mostly pirated books and stuff.


That's wrong.

Many LLMs are trained on synthetic data produced by other LLMs. (Indirectly, they may be trained on pirated books. Sure. But not directly.)


Likely the case for established model makers, but barring illegal use of outputs from other companies' models, a "first generation" model would still need this as a basis, no?


Why illegal? The more open models (or at least open-weight models) should allow using their outputs. Details depend on license.

But yes, 'first generation' models would be trained on human text almost by definition. My comment was only to contradict the claim that 'all LLMs' are trained from stolen text, by noting that some LLMs aren't trained (directly) on human text at all.


>Or is it that if they publish Chinese-first models it won't get publicity in the West?

This is a large part of it. Kai-Fu Lee's company (https://www.01.ai/) has been publishing open source Chinese language/market focused models pretty early, but the entire conversation around Chinese tech just isn't available to you if you don't speak Chinese, in particular these days given that good English language reporting on the Chinese tech sector just seems very scarce.


They are not "English-first". Deepseek-R1, for example, reasons in Chinese when you ask it a question in Chinese.


I've seen one of the ChatGPT models produce the occasional Chinese phrase even when otherwise reasoning in English about a problem given in English.


Does that apply in other languages too, like French?


One reason is that there is no "good" search engine in China. The most popular one, Baidu, is like garbage compared to Google search. The most useful training data in Chinese would likely be from the social media and video sharing platforms, which I guess is much more difficult to crawl and clean up.


A few thousand years of literature ain’t nothing…


Peanuts compared to the discourse available on the internet.

The literature that survived thousands of years are cream of the crop; you won't find lots of random unimportant dialog between people thousands of years ago, but you find that on Reddit.


Given premodern population sizes and literacy rates, historical texts probably don't exist in anything like the quantity that internet posts do. Even if they did, the information may not be relevant to the modern world.


> The most popular one, Baidu, is like garbage compared to Google search

It must be very bad when you see the walking turd that Google search has become over the years…


It is. In Chinese speaking countries where there's google available, no one is using Baidu


There's only ONE* Chinese speaking country, at least if you only count those that have a Chinese speaking majority population, or uses Chinese as the official language.

* for various interpretations of one.


Chinese is one of the offical languages of Singapore.


Do any of those countries have a good relationship with China and/or countries from there?


Singapore has a pretty good relationship with China (with all Chinas, actually). And we have plenty of Chinese speakers, too. I'm not sure how prevalent Baidu is, however.


I was under the impression that we just see the English stuff given that we're using English news channels.


I don’t see any indication that it’s English-first?


I’m going to go with: to ensure it is not disadvantaged in benchmarks


I wonder whether English text having fewer characters provides an advantage somehow.


not really, since tokenization combines multiple characters


Chinese is hard.


Why are so many American models multi-lingual, supporting hundreds of languages not commonly spoken in the United States?

Could it be that being multilingual results in a larger pool of human knowledge on the technical side compared to training on just a single language or 2. And on the business side, supporting more languages results in a larger TAM (total addressable market). Using english-language dataset for training LLMs is the default, not the other way like you insinuate.


That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages, like American models are English-first while doing the same, but that's not what's happening.


> That's clearly a different question. It'd be possible for these models to be Mandarin-first while still supporting other languages

What would a hypothetical "Mandarin-first" model look like to you?

I challenge the notion that the current models are "English-first" - that is an unsubstantiated opinion not supported by fact. I bet, dollars to donuts, these models are SoTA in Mandarin as well. When framed that way, asking "Why are they marketed as English-speaking models outside of China" or "Why are they really good at English" are simply not interesting questions - they have obvious answers.


> What would a hypothetical "Mandarin-first" model look like to you?

Given a language-agnostic prompt like "12 + 89", any explanatory text it outputs could be expected to be in Mandarin most of the time.

According to this test, Xiaomi's MiMo-7B-RL is an English-first model.


"12 + 89" uses the latin alphabet and is in no way language-agnostic in this context. I expect borrowed constructs to appear relatively more frequently in the language they were borrowed from.

Now I'm curious how Mistral models would respond to a "language-agnostic" phrases like "Rendezvous" or "coup d'etat"


You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions, but the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese characters is literally as uncommon as someone writing "twelve plus eighty-nine" in English.

In contrast, your examples would be spelled « rendez-vous » and « coup d’État » in French, i.e. easily distinguishable from their English descendants.


> You may think of these symbols as "Latin" because they're how people writing in Latin script happen to write mathematical expressions

No need for scare-quotes, Latin script is a proper noun and a technical term with precise meaning wrt text encoding - not "what I think."

> the exact same symbols are also used by Mandarin speakers, as well as in numerous other scripts. Writing math in Chinese

Which unicode code points do the Mandarin speakers and "numerous other scripts" use to write "12 + 89"? Could it be the very same code points as Latin script, which then are tokenized to the same vectors that the LLMs learn to associate more with English text rather than CJK in the latent space?

> i.e. easily distinguishable from their English descendants.

You're making broad assumptions about the tokenization design here that do not apply universally.


Precisely because the exact same codepoints are used for digits and mathematical symbols, there's nothing script-specific about them and their linguistic association is determined by the training data mixture. A model trained predominantly on text scraped from Chinese websites would learn to associate them more with Mandarin than English in the latent space, since that would be the context where they most often appear.


English won. The Chinese youth struggle to write their own calligraphy characters they can read now. Typing favors English.


It's easy and fast to type Chinese sentences using a keyboard.


Source?

This smacks of "I saw a headline once"-itis. Especially the fact that you refer to the Chinese characters as "calligraphy characters", as if that were the general term or something.


These are probably the headlines they're thinking about,

https://www.globaltimes.cn/content/747853.shtml

https://www.bbc.com/news/blogs-china-blog-28599392

Or more recently this one about character amnesia

https://globalchinapulse.net/character-amnesia-in-china/

None of these really mean that English has won, though. Rather that phonetics-based writing systems are easier to remember and use, especially in conjunction with digital systems that make it easy to map sound and context to symbols.

I wouldn't be surprised if characters are faster to read though. In English we have all these subconscious shortcuts like looking at the shape of the word, first and last letters, etc. But I think symbology can convey more at a glance. Thus the popularity of emoji


Ah no, I know myself that there have been headlines here and there.

I'm pretty sure there was some controvery in the linguistic blogging community even at some stage over the last couple of years, with someone writing an essay claiming the Chinese character system was in some sense less advanced and maybe on the way out, and this leading to a serious response or two, the usual fiery academic affair. I can't locate it this instant though.

I moreso meant for OP's low-effort dramatisation to not go unanswered. Framing it as "winning" some sort of language battle is particularly silly.

Your musings are interesting though, and the topic certainly is a fascinating one. Languages that use morphemes for writing are wild. Symbology is a cool word also - surely there has to be a lisp blog somewhere with that word in the title.


The pendulum already turned back. The current generation under 20 grew up with touchscreens. That obseletes input with pinyin; many don't care if the device has no keyboard.


Input is so interesting in China, basically a sorta t9 but just single letters and picking the right characters, with common/frequently used characters first, using pinyin. For example to say “ How are you?” You just type “nhm” (Ni Hao Ma) and 你好吗 shows up as suggestion/autofill. You can make surprisingly long sentences using this method.


> That obseletes input with pinyin

Uh? Pinyin input is by far the most popular input technique in China. I rarely see anyone using handwriting input.

That being said, it has nothing to do with English winning. It's just a Chinese input technique that uses the latin alphabet. English fluency in China is not very common, especially spoken English.


My father-in-law here in China uses handwriting input, but everyone else I've seen here uses Pinyin, and it's totally fast and natural for them.

And very true about the English. With some exceptions (of course), folks here maybe know a handful of words at best, and even then, pronunciation is usually pretty rough. People here really aren't using it; they are perfectly comfortable with their Chinese, and why wouldn't they be?

Anyone saying otherwise clearly hasn't been here to see it firsthand.


Just like German is written with almost the same alphabet as English, but that doesn't give you English fluency.


If only Unicode decomposed Chinese characters on a per basic stroke basis: it would be so much easier to have keyboards following that.


This has nothing to do with Unicode and everything to do with input methods, of which there are a variety. Some methods are indeed shape-based like you suggest: https://en.wikipedia.org/wiki/Chinese_input_method#Shape-bas...

By the looks of it, Pinyin (a phonetic one) won by a landslide, which I suspect this is the result of a long effort by the Chinese government to install Mandarin as the official language of China, above regional dialects (different regions would write similar characters but pronounce them differently, and defaulting to Pinyin has this "nice" effect of having people "think of how it would be pronounced in Mandarin first", even when the result are characters that would be read by a Cantonese speaker).


What? Only people I've seen use the writing input mode was old people.


Nearly everyone in the urban areas of China spoke some English when I visited way back in 1995. It's a bilingual society.


This is not true. I was in Beijing around then and never met a single person who spoke English if they hadn't learned it for professional reasons (they worked in tourism, international business, etc.).

It could not have been further from a bilingual society.


I suppose you probably were visiting some university districts/CBDs where people likely to have received higher education. Elsewhere, aside from basic "hello"/"how are you", locals in general are not able to communicate in English.


I lived in Beijing and Shanghai for 9 years (2010-2019) and this is NOT my impression at all.


Not sure which part you were in, but this is just not true in my experience. I've been to Beijing, Shenzhen, Guangzhou, and some others, and Mandarin really is a must if you want to even have a chance of communicating. I can't imagine how I'd function here if I only had English.

I've not yet been to Shanghai, and while I would expect the English-speaking percentage to be a bit higher, it would still likely only be in the single-digits by my estimation.


The mandarin language models obviously exist, but what would you do with them if they provided access to them? And what knowledge would be in them? What is the body of knowledge encoded in Mandarin? What does that look like?

Sad reality is that not many outside of China have the facility with Mandarin to use those models. Even non-native Mandarin speakers who claim to be "fluent", are often messing up intended meaning in text. Or making literal translations that wind up making no sense.

Inside of China, llm use will be Mandarin based. Outside, it seems to me English is the natural choice.

Irony of Irony, probably the best way for a non Mandarin speaking layman to test a Mandarin based model would be to use another LLM to translate prompts to Mandarin.

It's a sad future we're looking at.

Or a brilliant one.

Time will tell.


For it to be brilliant, AI needs to be a benevolent tool all the time. It would take just a few malignant actors to turn our world upside. I suspect it'll follow the same Internet and social media path. Great at first, grow markets, bring us together and then take a turn.


You’re right of course. That’s why these open source / weight releases are so critically important.


> Even non-native Mandarin speakers who claim to be "fluent", are often messing up intended meaning in text. Or making literal translations that wind up making no sense.

Happens with English as well, but non-native speakers of English still benefit from these models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: