Sorting in Japanese – An Unsolved Problem (2011)

mopreme · on April 12, 2019

Blog author here. Surprised to see one of my old posts on the front page while browsing Hacker News.

It's interesting to reflect on what has improved since I wrote it, and what has not.

Both Android and iOS, for instance, provide mechanisms to get this right, if you know to use them and expose them for those locales (and only those locales). For example, both have a Contact object that contain corresponding phonetic-reading fields for first and last names.

iOS Contact - see phoneticGivenName, phoneticFamilyName https://developer.apple.com/documentation/contacts/cncontact

Android contact - see PHONETIC_GIVEN_NAME, PHONETIC_FAMILY_NAME https://developer.android.com/reference/android/provider/Con...

For fun I tried using Google Translate to translate the kanji name in the post 淳子 in various contexts to see what Google thinks it is:

- 淳子 translated to "Dumpling"

- 淳子さん translated to "Atsuko"

- 淳子様 translated to "Sadako"

- 淳子さま translated to "Mrs. Lion"

- 淳子殿 translated to "Mr. Reiko"

- 私の名前は淳子です translated to "My name is Miko"

- 私の名前は淳子です。 translated to "My name is Reiko."

- 私の名前は淳子です！ translated to "My name is gyoza!"

I expected them all to translate to Junko or Atsuko. The variation and unexpected results for what should be exactly the same thing is very interesting.

lifthrasiir · on April 12, 2019

Many statistical machine translators like Google Translate are very sensitive to the availability of bilingual corpus. In this case GT seems to have learned that -子 means either a dumpling (餃子) or a female name ending with -ko, but haven't seen enough corpus to determine that the preceding 淳 is pronounced either Atsu- or Jun- in given context so it is guessing. Combined with the user-contributed corpus this can be rather disastrous: several machine translators had translated Japanese "初音ミク" [1] to Korean "이명박" [2] ;-)

[1] https://en.wikipedia.org/wiki/Hatsune_Miku

[2] https://en.wikipedia.org/wiki/Lee_Myung-bak

fenomas · on April 12, 2019

Cool work! But my experience with Google Translate is that nowadays it seems to be wholly focused on machine-learning translations of full sentences, so for single words the results can be almost entirely random. E.g. you type in the characters for the Japanese word for "raisin", and depending on which parts you change to kanji the results can be starfish, dried fish, dried persimmons, etc.

On the topic of sorting, I've always found it interesting that Japanese input forms with a pulldown menu for prefecture almost always list them in north to south order, rather than AIUEO or charcode or anything else. I think I read somewhere that this stems from some kind of official government encoding that was baked into Shift-JIS, or something along those lines.

expat2003 · on April 12, 2019

People doesn't seem to understand that they cannot have word-to-word translation when it comes to Japanese vs Latin root languages. Also there are grammatical and syntactical rules that you cannot get right by an educated guess - "educated guesses" such Google translate... See, Google translate doesn't care if a person doesn't know that when it comes to Proper names such as the name of a person the Kanji reads in a completely different way than the regular dictionary words. There is no translation for these class of words just as there is none for words like Jason, Tom, Samantha, etc. In such cases one needs to memorize the reading of the character, which reads that way only when it is used for Proper names. If the Kanji of a person have multiple reading such as 淳子, Atsuko, Kyoko and Junko, that person would use what is called Furigana, the Kana only reading showing the sound of that particular Kanji. This happens when you are filling a form, in a contract, in IDs etc. In other words feeding people names to Google only shows only one thing. That person doesn't know what is doing.

Sadako is Proper name of a popular iconic character in Japanese culture that embodies a little girl killer ghost and it is based on a real Japanese person that was said to be a psychic and lived in the 1900s. They made at least one horror Japanese movie out of that story and even Hollywood had a go at it! Now, how many girls you think they have been named Sadako in Japan by their parents? Once more Google translate doesn't give a damn if the person using it doesn't know what is doing.

In regular writing to mark Proper names a number of suffix are employed さま, どの, さん, くん, ちゃん and more. Every suffix has a different and unequivocal use case. It can be written in Hiragana, like I just did or by using Kanji. If the Kanji is adopted looks like 様 reads さま, 殿 reads どの etc. Also for such words there isn't a clear-cut translation, because they are employed as a result of the implications they come with. For instance 様 or さま is used when referring to customers or clients, but in a business letter you would use 殿 or どの with similar significance.

Most amusing is "淳子殿 translated to "Mr. Reiko". On Google Translate left windows in the left-bottom corner there is one of the possible correct reading of 淳子殿, namely Junko-dono. But on the right it's messing it right up. That's the demonstration that Google Translate has no brain. By the way Reiko is written 麗子 as a Proper name... and so is Rika, Akirako, Yoshiko, Urarako . Get the point? You need Furigana.

About gyoza, that's the name of a very popular kind of Chinese dumpling made in Japan, and there is no Kanji for it its written in Katakana only as チャオズ. But since it's a Chinese word you can used the two sounds ぎょう and ざ in Hiragana to write the two corresponding Kanji 餃子. To understand why, you need to read my post down the page. It seems Google translate cannot read the first Kanji right.

If you feel so inclined, mopreme, read my post down below which covers your original blog post extensively and throughout.

hermitdev · on April 12, 2019

Am I missing something subtle in this Kanji example, or all 4 names actually written the same?:

"There are four Japanese women whose names you have to sort: Junko, Atsuko, Kiyoko, and Akiko. This does not seem difficult, until they each show you how they write their names in kanji:

淳子 (Junko) 淳子 (Atsuko) 淳子 (Kiyoko) 淳子 (Akiko)"

I'm not familiar with Japanese at all (and have never had to deal with localization beyond date formats), even less so written, but these seem like wildly different pronunciations for some spelled the same.

I know English has its own large set of warts with pronunciations and spellings (even disregarding US vs every other English speaking nation), but this seems overly odd.

How do you get further context on how to pronounce a proper name like this? The post mentions context, but obviously in the above example, lacking the "abc" spelling (as the post terms it), what context do you have to know the proper pronunciation?

Iv · on April 12, 2019

It seems odd because it is. I live in Japan, tried to learn Japanese like I did learn English and German: through reading, as the true asocial geek I am. Well, that's not possible unless you know about 1000 kanjis, unless you stick with kids book or some mangas (which, actually, I am not that interested in)

To learn Japanese you need to learn to speak it first.

I stopped caring about kanjis wen I realized that a 15 year old, top of her class, was having difficulties reading news articles because of some words she did not know.

That's polemical and I would not tell it that upfront to my Japanese friends but I feel like kanjis are a vastly inferior writing system compared to the alphabets they have. The reason why it still exists is to create an educational hierarchy: the more kanjis you know, the more educated you are.

Kanjis are of Chinese origin, they occupy a similar place in Japanese culture than Latin does in European cultures. I wish they would do like Koreans and get rid of it altogether, but it will never happen. The culture here is far too conservative for such a brutal change. They even call the furikana (kanji "subtitles" that explicit the pronunciation) as a dumbing down.

This is an example where obfuscation is confused for depth.

mikekchar · on April 12, 2019

And as a counter example, I like kanji and learned Japanese by reading (though I also like manga, so perhaps it was helpful). I find kanji very helpful in learning vocabulary. In fact, I actually measured how fast I could learn vocabulary containing kanji I didn't know by learning the kanji first, versus learning the vocabulary phonetically. It was about the same. However, as there are only about 2200 kanji in the list of commonly used kanji and there are about 20,000 words you need for adult level proficiency, it means that each kanji is used in, on average 10 words. Actually, it's even better than that because the first 1000 most common kanji are used in 90% of the words you will encounter. I can learn vocabulary with kanji I know dramatically faster than vocabulary with phonetics alone. Ironically, I think it's even more helpful in Japanese than it is in Chinese because there are so many readings for the character that if you learn the word phonetically, you may have no idea what the root of the word is.

Kanji is frustrating to write, but a joy to read, IMHO. Yes, you need to learn 2000 (or so) characters, but, let's face it -- you've got years to do it. I can now read words that I don't know and have a pretty good idea what they mean -- something that would be practically impossible phonetically in Japanese. In fact, when I'm faced with new vocabulary, I often ask people to write it for me (just in the air) -- as soon as I see the kanji I almost always know what the word means.

I've often thought how superior this writing system is from roman phonetic characters (which bizarre historical spelling anachronisms to boot) ;-) I appreciate that you don't agree, but I hope you'll also understand that I have no intention of grading your education level on the number of kanji you know.

echelon · on April 12, 2019

I agree with you wholeheartedly.

Kanji has aided in my study of the Japanese language in ways I couldn't fathom before trying to do so. I made it to N5-level proficiency [1] after a few semesters of study and only begrudgingly studied Kanji because it was required by the text. I hated it and really only wanted to speak and understand spoken Japanese.

Then I discovered Wanikani [2], a spaced repetition service that focuses on teaching Kanji reading. Unlike Anki decks, Kanji is literally the only thing Wanikani focuses on, but it does so incredibly well. With diligence it's even possible to learn all Joyo Kanji within about a year.

After taking Kanji seriously I began to understand the root meanings of the vocabulary I had known all along. It let me form associations I previously wasn't privy to and enabled a much faster uptake of new vocabulary. Instead of remembering why the syllables さいこう (saikou, pronounced "sigh co") means "best" or "utmost", I can simply recall the kanji: 最高. The character 最 means "most" and the character 高 means "tall". "most tall" = "utmost".

Japanese is extremely logical like this, as is their grammar. For fans of logic it's really top-notch.

If you're learning Japanese, please don't do yourself a disservice by skipping Kanji. A fully-integrated learning approach that includes reading will yield dividends in the long term.

[1] http://jlpt.jp/e/about/levelsummary.html [2] https://www.wanikani.com/

Iv · on April 12, 2019

Of course kanji helps. Like Latin helps with many European language. If you learn Latin before learning French it will feel extremely logical.

Making it a pre-requisite to write a single sentence in French, however, seems like a waste of time and goodwill.

Riverheart · on April 12, 2019

Not too discount your experience but I found WaniKani to be very tedious and slow paced which steered me away from learning Kanji.

james_s_tayler · on April 12, 2019

I learned just over 17,000 words and for that needed to know somewhere north of 3000 kanji. ~2200 is just the joyou list but once you get into novels and the like you can expect to encounter i'd say around 3500 would be close to maxing it out with the last 0.01% being another couple of thousand, but that's getting in to really obscure shit that only a smattering of the general population know.

I learned kanji first before speaking. Learned to read and listen for about a year before venturing out to conversation clubs and the like. Without a shadow of a doubt reading massively, massively, massively boosted my ability in all other areas. Fastest way to learn new vocabulary.

But I gotta be honest - I used to love kanji and found joy in it. These days I think Korean really got it right with their writing system. That shit is super logical. Kanji is a giant, beautiful pain in the ass.

mikekchar · on April 12, 2019

It's been years since I did any studying. You've inspired me to start again :-)

w00kie · on April 12, 2019

I love that I can see a word I don't know and get good grasp of its meaning (you can somewhat do that in Latin languages if you have a good grasp of latin and greek roots, but not as reliably). However I still cannot read it since I don't know with any certainty how to pronounce it if I've never heard it before.

And in the old analog days, even if you're an expert with keys and stroke counts, it was still much faster to search a roman character word in a dictionary than a Japanese word you don't know the pronunciation of.

throwitallaway8 · on April 12, 2019

Agreed. In a pure utilitarian sense, Kanji might not be that helpful, but it has undeniable cultural value.

The same ambiguity mentioned in this article that makes it hard to learn and approach for beginners can be and has been wielded to great effect by writers in both literary and comedic works in ways that have no real equivalent in languages without similar mechanisms.

It's yet another one of those things that makes Japanese such a joy to learn.

SllX · on April 12, 2019

I’ll back you up on this and add one additional point: even in the West where we have our Roman alphabet which largely works fine, and our Arabic numerals which also work fine, I’ve often thought adopting a limited number of kanji, literally less than 10, could add some important precision to our own writing. The rest of this post isn’t really for you specifically, but anyone reading through this comment thread with an open mind.

Date formats would really only require three:

- 年: year

- 月: month

- 日: day

We debate about both separators and order of presentation, but using three characters as suffixes would add some semantics where none really existed, and you might have to make some assumptions about which way the day goes and which way the month goes.

For those that don’t know, 3月 is March, 12月 is December, 31日 is the 31st day and 2019年 is the current year. So April 12th 2019 which might be presented as 2019.04.12, 4/12/2019 or 12/4/2019 would instead be written as 2019年4月12日 or 12日4月2019年 or even 4月12日2019年.

The beauty here is the ideographs provide enough semantic meaning on their own that the order of presentation stops mattering and they are perfectly fine date separators. They are always appended as suffixes to the numeric designator and they don’t add much to the cognitive burden for anyone learning how to read. We already mix Arabic numerals into just about every writing system, Latin, Cyrillic, Japanese, Chinese, et cetera, we have distinct punctuation marks in most writing systems, and we pretty much all know the difference between $12.34 and £12.34 and how 56¢ differs from 56%.

This is a small example, but I hope it provides some context to some who don’t know any kanji the power and precision that ideographs can provide at a glance.

One more thing, while I offered up these three specific ideographs specifically as date separators, they do provide standalone meanings as well.

年 is a counting word for years. It doesn’t have to be AD, and in Japanese or Chinese they do have their own calendars, but 2019年 is still unambiguously understood to be 2019 anno domini.

月 is a standalone word for the moon, which is functionally why it is used as a month counter. 日 is a standalone word for the Sun. This perfectly preserves the association between months to the moon cycle and days to the solar cycle, which I think is pretty cool, and it certainly does so a lot better than mere dots or slashes or dashes.

Iv · on April 12, 2019

Arguably, emojis are going in that direction.

And, seriously, the rest of the world considers it totally illogical to use MM-DD-YYYY. In almost any country, MM are the middle digits. 4月12日2019年 is a format I have never seen used.

SllX · on April 12, 2019

It’s not a format I’ve seen used either, but it isn’t as if the Japanese perfectly retained Chinese conventions for ideograph usage either. They derived Katakana from Kanji, Hiragana from Katakana, integrated them all into a cohesive written Japanese grammar, formalized the stroke order into a strict system propagated through their school system and from that, developed the SKIP dictionary lookup method which is the only lookup system for Kanji that ever made sense to my foreign brain.

My point? Writing systems are flexible, and more so the more semantic notations you can integrate and propagate through society. You would never see 4月12日2019年 today, but hypothetically it wouldn’t matter if you did because it is inherently unambiguous so long as you understand the meaning of the Kanji involved.

Besides that, while you won’t see 4月12日2019年, you still might see 4月12日.

kondro · on April 12, 2019

Fun fact, emoji was invented by the Japanese. :-)

h1d · on April 12, 2019

I wonder what people think its root is.

jayd16 · on April 12, 2019

01d/01m/01y could be used just as well, no?

SllX · on April 12, 2019

I’ve actually tried that and variations of that, and I’m not going to lie, it’s about as aesthetically pleasing to the eye as going back to Roman numerals.

Iv · on April 12, 2019

You know, I started by loving kanjis too. I like the way they combine, the stories they tell. And I can imagine that when becoming proficient with them, they form a nice cozy system.

But face it: you say I have years to learn them. I could learn 3 languages with the effort wasted to learn kanjis (which are half a language because then you also need to learn the pronunciation of words). No, I'll perfect my English instead and learn spoken Japanese, as it is almost a separate language.

In terms of communication, evolution, ease to learn and use, kanjis are an inferior system. It took me a long time to accept saying that, because I am of course biased as a lazy gaijin, but really, seeing how easy Korean feels compared to Japanese was the nail that convinced me that there is a fundamental truth in that.

Interestingly, hangul (the Korean alphabet) was adopted to replace chinese characters and was opposed on the ground that it made the written language to easy to learn. Hah.

fenomas · on April 12, 2019

I'm not sure it's even possible to get proficient at spoken Japanese without learning kanji, past some intermediate point. You either pick them up without intending to (enough to read anyway, if not write), or you don't get proficient.

The issue being that proficiency in Japanese means navigating endless homophones and kanji compounds, and kanji are what disambiguates that process. Japanese vocabulary has a huge variety of words whose readings are all permutations of a handful of syllables - words like koutai, koutei, taikou, teikou, keitai, koukei, taitei, etc, etc.. And it's not just a matter of keeping them all separate - a proficient speaker might know two or three unrelated meanings for any given reading. Kanji are what makes that possible. I mean, it's challenging to learn even 10 kanji, but once you do it's far easier to learn 100 or 1000 compounds made out of them.

So if you want to say kanji are annoying or not worth your time, that's obviously fair enough - but they're not useless and they're not just there to obfuscate things.

Wowfunhappy · on April 12, 2019

> I'm not sure it's even possible to get proficient at spoken Japanese without learning kanji, past some intermediate point.

Presumably the spoken Japanese language predates the writing system, no? Or if not, I would imagine that a couple hundred years ago, a significant portion of the population couldn't read Japanese but could speak it. How did they learn?

(Of course, let me know if both my assumptions are false—I don't know much about Japanese history beyond a single class in college, which focused on the modern era.)

fenomas · on April 12, 2019

> Presumably the spoken Japanese language predates the writing system

Its roots do, naturally, but the kanji compounds I mentioned don't. They're words made from kanji, not words that kanji were created to write down.

It's a hugely messy topic and I'm no expert, but the broad strokes are that Japanese derives from an "original" language (which had no writing system), which was greatly affected by multiple waves of Chinese influence over many centuries, and much of what seems complex in Japanese is the result. Anyway suffice to say that my comment was about Japanese today, not 1500 years ago before there was a writing system ;)

danans · on April 12, 2019

> Presumably the spoken Japanese language predates the writing system, no? Or if not, I would imagine that a couple hundred years ago, a significant portion of the population couldn't read Japanese but could speak it.

You're correct. Prior to the industrial era, most societies had very low literacy rates. But nearly everyone could speak, and many people who were illiterate could even excel at speech and language related tasks.

Among the worlds most grammatically complex languages are the Inuit languages of the far north areas. These languages were completely unwritten until very recently.

Iv · on April 13, 2019

A typical 12 year old Japanese kid will be fluent in spoken Japanese and pretty bad at kanjis.

fenomas · on April 13, 2019

Standard school curriculum for a 12 year old covers 1000-1400 kanji, and I said "proficiently", not "fluently".

wodenokoto · on April 12, 2019

I completely disagree. Learning Korean vocabulary is so much easier than Japanese.

With somewhat phonetic spellings you can link vocabulary you read to vocabulary you hear.

In Korean (just like English) you can much more easily misspell a word and be understood. In Japanese you often times cannot even “give it a go”, unless you are aided by a IME, which is essentially phonetic.

fenomas · on April 12, 2019

> The reason why it still exists is to create an educational hierarchy: the more kanjis you know, the more educated you are.

Things are nowhere near this simple. If there was any value to abandoning kanji it could be done tomorrow, but the reality is that all-kana text is incredibly hard to read, and trying to learn Japanese's myriad homophones would be nigh-impossible without kanji to disambiguate them.

Kanji are complex and archaic, sure, but it's not just conservative values keeping them around. The Japanese language just isn't feasible to learn or to use without them.

Nadya · on April 12, 2019

>That's polemical and I would not tell it that upfront to my Japanese friends but I feel like kanjis are a vastly inferior writing system compared to the alphabets they have.

The reason Kanji are used instead of Kana are due to the vast number of homophones from a language limited in the number of phonetic sounds it has. It is how you can tell はし「橋」 from はし「端」 from はし「箸」. Sure, context would almost always let you know which word is used but that requires some additional parsing. Other languages have more sounds, thus fewer homophones, and this becomes less of a problem to determine which witch is which. Or as just shown - "witch" and "which" can be spelled differently even when they sound the same, something less possible with kana. Japanese has an insane number of homophones and reading kana-only sentences quickly becomes a headache because of it.

I don't "read" sentences that contain kanji. I glance over the sentence and understand what it is saying. It's like watching someone act out a scene vs reading the scene's script. Although I am still learning Japanese, for sentences where I know all of the kanji, I can understand the Japanese much faster than the English translation (my native language).

ehaliewicz2 · on April 12, 2019

Kanji are not obfuscation, they are useful mostly because Japanese has many homonyms. Without them, text is far more obfuscated! I'm sure you've read kana only text before.

baud147258 · on April 12, 2019

Except that you don't need latin to read news article, at most you'd get to know the etymology of some words and read untranslated older text

yongjik · on April 12, 2019

> To learn Japanese you need to learn to speak it first.

Well, to be fair, that's true for any language. Learning English by reading does not work: ask any Korean (or Japanese) who went through public English education in the 80s or before.

ssttevee · on April 12, 2019

It's actually not possible.

There are some cases where the kanji characters are only used for the meaning and have literally no contribution to the pronunciation.

That is why every Japanese service provider will ask you for the pronunciation, without fail.

asutekku · on April 12, 2019

That’s why japanese people sometimes when introducing themselves also mention the kanjis because of the wildly different readings. There is literally no way knowing the reading if the user doesn’t provide it.

csours · on April 12, 2019

Is this why business cards are so ubiquitous?

mikekchar · on April 12, 2019

I don't know if this is really true, but I've been told it's really so that you don't forget people's names when you first meet them. In Japanese, it is impolite to use the second person pronoun ("you") with people you don't know. So if you forget their name, it can get pretty awkward if you want to talk about them. When you receive business card, you are suppose to inspect it with great attention and it is common to hold it in your hand or put it on the table in front of you while the person is there. Whether that really is an aide to help you remember names, it certainly does help! It's the thing about Japanese culture that I'm the worst at (and to add insult to injury, I've yet to print any business cards after 10 years here -- it's so embarrassing when I meet people :-P ).

fenomas · on April 12, 2019

> wildly different pronunciations for some spelled the same

The fundamental point to understand is that those names aren't spelled the same, they're written the same. The writings and readings are essentially distinct things; neither is a representation of the other.

(As an aside, Japanese doesn't really have spelling as we think of it. It has phonetic characters, but their names map 1:1 with their pronunciation, so "spelling" a Japanese word is the same thing as just saying it.)

a1369209993 · on April 12, 2019

> their names map 1:1 with their pronunciation

Not quite, eg ki'yu vs kyu vs ki'u and n'a (んあ) vs na (な), but unlike in english those tend to be pathological corner cases that you can either ignore entirely in practice or easily handle (eg the Ci+yV <-> CyV convention).

teraflop · on April 12, 2019

All of the examples you mentioned are pronounced differently, so I'm not sure what point you're trying to make.

a1369209993 · on April 12, 2019

n'a (んあ) and na (な) are pronounced the same (not 1:1)

k-ee-y-oo and k-y-oo are distinct pronunciations that are both written as kiyu (きゅ) (also not 1:1, though this could be considered a accent maybe?)

I wasn't trying to make a point, beyond "spelling a [any] word is not just the same thing as saying it".

level3 · on April 12, 2019

You're confusing spelling (using the alphabet) with spelling (using hiragana/katakana). In Japanese, spelling IS the same as saying it.

きゆ and きゅ are pronounced differently because they're spelled differently (notice the size of the ゆ). Same goes for んあ and な.

fenomas · on April 12, 2019

I'm afraid I don't follow either. I meant that in Japanese spelling a word is just saying it, not anything about how words are romanized, etc.

That is, you say the word たに by saying た and then に, and you say the word たんい by saying たん and then い. Equivalently if you were to "spell" the word きゅう you'd say きゅ and then う -- you wouldn't say き・ゆ・う because that would be a different word.

(Incidentally na and n'a are not pronounced the same, as the previous example hopefully illustrates.)

yorwba · on April 12, 2019

I'm not aware of any words beginning with ん, it's usually placed between two other morae. If the following mora begins with a stop, it acts as a nasalization of that stop, similar to English n in "blank" vs. "bland". Otherwise, such as for んあ, it nasalizes the preceding vowel. E.g. the name 闇亞 (あんあ) is pronounced [ãa], whereas あな would be pronounced [ana].

I'm also not sure what you mean by kiyu and kyu, since きゆ and きゅ are clearly orthographically distinct.

gizmo686 · on April 12, 2019

This is coming more from having looked at Japanese in my phonolygy class than learning Japanese itself (although I have done the latter and noticed the distinction there as well) んあ is 2 syllables, な is one. This changes the prounounciation (most notably the length of the nasal stop.

Ki-yu (きゆ) and Kyu (きゅ) arr distinguised by the size of the ゆ character

rococode · on April 12, 2019

Many kanji have numerous pronunciations. I have a dictionary extension on my browser that may help illustrate:

https://i.imgur.com/Nftkjks.png

According to the dictionary, the word 淳 can be read as jun, shun, atsui, atsu, atsushi, kiyo, kiyoshi, makoto, or sunao. And evidently this is not fully complete, since Akiko can also use that character.

It's definitely one of the more annoying parts of Japanese, and ultimately you just have to learn common words and for ones where there are multiple common pronunciations (like these names), someone will ask for the pronunciation (for names) or it will be clear from context (for words).

Also, if you weren't aware, there's also a thing called furigana, which is essentially writing the alphabet really small above a kanji word so you know the pronunciation. Really useful and something that Chinese sorely lacks imo.

bigger_cheese · on April 12, 2019

>According to the dictionary, the word 淳 can be read as jun, shun, atsui, atsu, atsushi, kiyo, kiyoshi, makoto, or sunao. And evidently this is not fully complete, since Akiko can also use that character.

I think the "ko" suffix comes from the second Kanji character "子" which means child (I think, it has been over a decade since i studied Japanese but that particular Kanji character looks familiar to me). So I believe it is similar to how a lot of English surnames end in "son" (Johnson, Anderson, Harrison) etc.

rococode · on April 12, 2019

Oh yeah definitely! I never noticed the similarity with -son, that's really interesting. I was actually specifically referring to the "Aki" part of Akiko which isn't listed in the dictionary I'm using.

Now that I think about it though, I've never seen that read as Akiko before (or Akiko written that way). I don't see it on any websites and Microsoft IME doesn't list that as a version of Akiko at all (~10 pages of suggestions). Makes me curious about where the author got that reading from...

fenomas · on April 12, 2019

It doesn't mean much for a particular kanji/reading to not appear in your IME - there are just too many combinations, and sometimes a given reading will only be common in certain areas or even families. Plus people sometimes just make up new ones :)

With that said, here's a site showing the "akiko" reading: https://name.sijisuru.com/Pname/pdetail?pname=%E6%B7%B3%E5%A...

(Searching by reading, the same site lists 291 different ways to write "akiko". Yipes.)

thaumasiotes · on April 12, 2019

You're right that 子 means child.

But there is no similarity to English (or any language) patronymics. 子 was used as an element of girls' names because it's good for a girl to be youthful. "Johnson" lets you know that the person is the son of John, not that the person is himself named John.

bigger_cheese · on April 12, 2019

Ahh that makes sense thanks for confirming. I studied Japanese in Primary school until grade 10. We had only just started on Kanji when I stopped taking the classes. I was surprised I could still read all the hiragana and katakana in the source article I guess I retained more of it than I thought.

needle0 · on April 12, 2019

Yes they are all written exactly the same, and the reason it seems so bizarre to you is because you're trying to think of everything in the framing of phonograms, where rule number one is sound and characters are inseparably bound. Since that was probably the only type of language you dealt with your entire life, that feels like the natural way of things, and anything that goes against that rule feels wrong.

Japanese and other Asian languages operate under the concept of ideograms, where sound and characters are NOT 1:1 mapped. As unbelievable as it may be, not much problems occur in everyday life when everyone involved is aware of those rules and operates under that assumption.

The writing system you were born to is not the only valid way to do things, much like Imperial is not the only measurement system, Christianity is not the only religion, vi/Emacs are not the only text editors, etc, etc.

pif · on April 12, 2019

It's not a matter of who has it bigger!

The basic question the OP asked, and you did not answer, is: if those names are written the same, how do you know which woman we are talking about?

needle0 · on April 12, 2019

If there are people present in the same space with identically-appearing names, they are merely given secondary characteristics (be it family names, explicit pronunciations via yomigana, or some other property) to differentiate. How is that any different from if there are people present in western-language space with identical first names?

wodenokoto · on April 12, 2019

You are missing the subtle fact that they are written the same :)

For that situation, where you only have the name, you simply can’t tell with confidence the reading.

Japanese name law (at least last time it was paraphrased to me) allows you to attach any non-obscene reading to any non-obscene kanji configuration.

All business cards clearly show how to read the name printed on it.

Yes, this is odd.

segfaultbuserr · on April 12, 2019

https://en.wikipedia.org/wiki/Japanese_name#Difficulty_of_re...

gwilkes · on April 12, 2019

Yes, they are all written the same. This is not too much of an issue as not only do sites like Amazon ask for the pronunciation of your name, most any important paper form would have it as well. In more informal contexts you would just have to ask the person how to pronounce the name. We often need to do this in English with unusual or foreign names.

jhanschoo · on April 12, 2019

On Japanese forms that require you to use your name, you provide them with how your name looks in kanji, as well as how they are read (in their syllabic alphabet). When introducing yourself in speaking, you may also mention how your name is written.

It's not a problem in the sense that when you are in a position to ask someone for their name, you are also in a position to ask them for both the orthographic and phonetic versions of their name; it's just not something speakers most other language are familiar with handling.

Also note that most Japanese people you encounter will have names with a couple obvious readings. In the 淳子 example, Junko should be the most likely reading, followed by Atsuko, seeing as "atsu-" is typically associated with a different kanji. Similarly for Kiyoko and Akiko; in usage out of names, "kiyo-" and "aka"/"aki" are primarily written with other kanji. Since all "native-Japanese" readings for the word are typically written with other kanji, and this kanji is rarely seen in normal text with a native reading, what's left is the most common "Chinese" reading for the word, "jun".

There are more "species" of Japanese personal names, less common, including names written completely in their alphabet, or names where kanji is used only for their phonetic reading so that when written together it forms a native-Japanese word (and hence name), or archaic names with archaic or metaphorical readings, or names of foreign East-Asians that are read as an approximation of how they are pronounced in their respective native language.

Put together, most names have an obvious reading and with enough time in Japanese society you should know the obvious exceptions, but the rules for pronunciation are still sufficiently haphazard and irregular that asking for pronunciation is necessary to be sure. c.f. mispronunciation of Saoirse in English for a related phenomenon. In forms,one still does better by asking for both orthographic/phonetic rather than having a huge rulesets and tables to divine pronunciations from orthography (with necessary mispredictions), except perhaps in interactive contexts like in IMEs.

You may think this all is complex, but when it comes to names, there is always a lot of complexity to it based on historical linguistic and orthographic phenomena, but one is blind to them when one grows up in them. Consider that the pronunciation of many English cities are not what you'd expect from just looking at how they are written, or that York comes from Proto-Celtic Eburos + akom -> Latin Eboracum -> Old English Eoforwic -> Norse Jorvik -> Middle English York. Also consider that in many languages, e.g. Latin, words need to be memorized in more than one form, since historical processes make it that the stems for different tenses cannot be regularly constructed.

yomly · on April 12, 2019

This is a top quality post!

Just came to add that in a vacuum, every language seems insane:

- Most European languages have genders attached to every noun, you have to memorize them. French/Spanish/Italian has 2, Russian and German has 3

- every Chinese word has one of 4 tones (for mandarin, good luck if you want to learn a dialect) you also have to memorize these

- in European languages you have verb conjugation and some languages also have declension which is foreign to other languages.

That said, there seems to be a lot of implicit logic where most native speakers get by with a combination of rote memory and also some heuristics which capture most cases well enough.

hermitdev · on April 14, 2019

Thanks, this was very informative.

Japanese, kanji in particular, seems to me to be unnecessarily complex (from an American that's spent most of hos life in the boonies).

You mention they have a separate phonetic alphabet to describe kanji? I know I'm barking up the wrong tree, but why not abandon kanji and just use the phonetic alphabet? Seems to get to where everyone wants to be and without the ambiguity that kanji involves.

I imagine the answer to these probably comes down to some combination of tradition or cultural pride. I don't mean this with any malice, it's just my curiosity and likely cultural ignorance.

jhanschoo · on April 17, 2019

> You mention they have a separate phonetic alphabet to describe kanji? I know I'm barking up the wrong tree, but why not abandon kanji and just use the phonetic alphabet?

It is tradition + cultural pride, and cost of switching. It's worth noting that in the modern era, governments of almost all countries that traditionally used Chinese characters for writing undertook massive reforms to limit their use.

* In Vietnam, French colonization forcibly replaced the traditional education system, and along with that introduced the Latin script for writing.

* In Korea, Hangul was invented by one of its kings, with the letterforms created from an abstract representation of the oral organs that are exercised in creating each particular sound, nevertheless, the script it only really took off centuries later alongside a growing Korean nationalism.

* In Japan, a phonetic alphabet developed that can be written in two styles (think capital/small letters); one from components of Chinese characters, the other from cursive writing of those characters. There were certainly proposals to eliminate Chinese characters, but what happened was that the number of Chinese characters in publications that could be used in publications without being accompanied by a phonetic spelling was limited to a couple thousand, and this persists to this day. Note that Japanese names are allowed to draw from a much larger repertoire of characters.

* In communist China, under communist influence, the government embarked on a project of simplifications of Chinese characters, with possibly an objective of reforming it to a phonetic alphabet. The first round of simplifications were based on common shorthand and cursive styles. The second round incorporated more drastic simplifications with less precedent, and it failed miserably, and the Chinese abandoned the project.

* Hong Kong and Taiwan did not seem to have embarked on any simplification project. There probably exists some pride at continuing to use traditional characters.

In all these countries, resistance to reform came from the literati, for which knowledge of Chinese characters were a shibboleth for elitism. Two arguments are frequently trotted out:

* One is that Chinese characters helped to distinguish homonyms. But investigating the situation in Vietnam and Korea, this is likely to be quite unimportant in clear communication. I suppose that this is because context will make clear in most instances, and use of synonyms should make up for the others.

* Another is that a change of script would cut one off from their existing literature. But already, the respective languages have changed so much that modern readers cannot read ancient texts without a lot of guidance and learning new meaning.

Interestingly, in modern Japan, there exists an problem with kanji familiarity, and I think this phenomenon is growing in China too. More and more people are forgetting how to write Chinese characters, even though they can recognize them perfectly. How one enters kanji into computers is by entering the phonetic pronunciation, and using computer software to identify the correct characters to convert them to. So you can see that one no longer creates the character from memory very much anymore, and when you don't practice it you lose it.

If you look at historical context, the reforms were always associated with westernization and nationalism. I think that if Japan were to seriously consider reform to a completely phonetic script, it would be by associating with the phonetic script a notion of national pride. In this respect the Japanese phonetic scripts hold less promise than Hangul, for while the former is a derivative of chinese letterforms, the latter was created by a national king, no less, and based rationally on depicting fundamental linguistic phenomena.

P.S. y'all Americans should adopt metric.

devilsbabe · on April 12, 2019

> How do you get further context on how to pronounce a proper name like this?

You don't. There's no way to know which pronunciation to use other than the person telling you.

unsignedint · on April 12, 2019

You can't, and legally, you can register any way to read certain kanji -- Japanese system only registers how it's written and do not track how to read them. (Some other government document requires you to register how to read them, but there's no limitation.)

So if you really wanted to 淳子 to read completely outside of how it's traditionally read, that's acceptable. So called DQN names in Japan are largely because of this leniency.

needle0 · on April 12, 2019

As a Japanese native, it pains me a lot when the westerners come upon linguistic differences like this and start calling them things like "overly odd" or "inferior" as they seem to be doing in the comments here. I can't help but smell a whiff of anglophonic condescension - I had thought political correctness have swept over the west over the last few years, but perhaps they only selectively apply to those who can talk back to you?

cmroanirgo · on April 12, 2019

Reading the article showed me how ignorant I was to this problem... so much so that I'm surprised I haven't seen more articles like this in the past.

For all the talk on ML and PC here, it's surprising that language is still a huge unsolved issue. As someoneone who knows a tiny amount of Japanese, I'm surprised that a common solution is to provide the katakana equivalent. Outside of the tech realm, is this really what a native Japanese would write on (say) a physical/ paper document: both the Kanji and Katakana?

fenomas · on April 12, 2019

Yes, entering a name's kanji and kana separately is standard for most any form in Japan, whether digital or on paper.

opan · on April 12, 2019

There was a scene in Sakurada Reset where the main character writes his name on a form in katakana and then is questioned about it. He says something about how people typically pronounce his name wrong so he writes it that way.

Isinlor · on April 12, 2019

I don't think languages deserve any special protection. People do.

Also, I think that all languages are supper weird, and because of that they have their own unique cultural value that can't be measured. But in terms of economical value:

1. English happens to have the most speakers if you sum L1 and L2 speakers (1.132 billion)

2. English has by far the biggest non native speaker representation (753.3 million)

3. Speaking English gives a lot of economical opportunities, because English is widely spoken among the richest people in the world. By rich people I mean top 1% with more than 30 000$ income per year i.e. people living in USA, Western Europe, Australia etc.

Because of these 3 points I do consider all other languages, including my native Polish, inferior as a tools. And I say this even tough I'm fully aware that millions were dying across world and history in defense of their native speech. History of my own nation and my native language is a perfect example. Millions of Poles died in defense of Polish language against Germans and Russians.

"We won't forsake the land we came from, We won't let our speech be buried. We are the Polish nation, the Polish people, From the royal line of Piast. We won't let the enemy oppress us. (...) The German won't spit in our face, Nor Germanise our children, Our host will arise in arms, Spirit will lead the way. We will go when the golden horn sounds. (...)"

But I think it is exactly the fact that we were not able to communicate with each other that lead to the World Wars in the first place.

I'm studding now in English in a super diverse Dutch university with Germans, Russians, Chinese, Koreans, Arabs, Belgians, French and many other nations, and thought that some 100 years ago we would be all trying to kill each other while living in trenches is just unimaginable. English facilitates that and I would want more people across the world to participate. Being defensive and overly proud of our native language will not help with that.

snrji · on April 12, 2019

Couldn't disagree more. As you said, people do deserve protection, but who do you think that you are protecting when you protect languages? Exactly, when you protect a language you are protecting people who speak it. If you want to get rid of your native language, fine, but do not force other people to do so.

Isinlor · on April 13, 2019

I don't want to force anyone to do anything. I just express my opinion. And certain things are just facts like learning English will give you superior economical power.

I don't think there is any reason to protect language from criticism. Polish declination an conjugation as well as use of a lot "sz", "cz", "dz" etc. makes it hard to learn language. It's a fair criticism. If you want to do machine learning in Polish you will have a lot of issues. No one should get offended by that.

Also, there is also a lot of forms that makes Polish difficult to use for native speakers like "wyszedłem" (I, as a man, went out) the correct spelling vs. "wyszłem" an incorrect spelling that seems to be easier to use for many people.

Therefore, a lot of language protection is focused on excluding people. You don't know how to use language correctly so you are inferior. It's very common sentiment among Polish people. We even have a special body to say what should be considered correct - Polish Language Council. French have Académie française.

Even in English, especially in academia, people like to have this snobbish attitude that you should not mix British and American forms in a single text. This is certainly not intended to protect anyone, but to exclude people less proficient in English.

I think the most stark example of excluding people based on language I saw myself was in Belgium in Dutch speaking Flanders. If I'm not mistaken they have a law that explicitly forbids teaching bachelor degrees in anything but Dutch. This excludes a lot of French speaking Walloons from participating in Flemish universities. Some of them end up in Netherlands where attitude to language is less protectionists and a lot of bachelor programs are in English.

On the other hand, in Wallonia you may come across people who will pretend to not speak English. They will understand you, but they will be hostile. It's their part of country, their rules, but I'm not afraid to call it rude.

Again, people deserve protection. For example, you should be able to conduct your business with your government in the language that you speak. But there is no need to be protective of languages themselves and often this protectionism is a weapon on its own.

Baeocystin · on April 12, 2019

There are many wonderful things about both the Japanese language and writing system, and I say that with sincerity.

Simplicity of sorting is not one of those things.

cryptnotic · on April 12, 2019

[flagged]

tinus_hn · on April 12, 2019

What do you mean with ‘fashionable minority’? It sounds pretty condescending.

trw999 · on April 12, 2019

https://en.m.wikipedia.org/wiki/Four-Corner_Method

You can sort Chinese characters (including Kanji but i'm not sure they use the Four Corners Method) by the Four Corners method. Why would you need to sort kanji phonetically in the first place? Do Japanese users actually expect names to be sorted phonetically? English speakers don't expect names to be sorted by IPA so consistency of the sorting scheme should be all that matters.

aidenn0 · on April 12, 2019

Kanji in Japanese are sorted by writing them out phonetically in Kana and then sorting by the kana.

English speakers don't expect names to be sorted by IPA because most English speakers don't know IPA, but all Japanese speakers do know the kana.

The fact that they are sorted as-if written in kana means that the exact same written character will be sorted differently according to context. For "not names" we actually do have fairly good algorithms for determining the pronunciation of Kanji, but for many names there is literally no way to know how to pronounce them without asking.

cooper12 · on April 12, 2019

I actually made a serious attempt at learning the Four-Corner Method for kanji [0] and it was very frustrating. It would be difficult to determine what parts of the kanji belonged to which corner, and which exact shape they corresponded to. And strokes wouldn't always be interpreted the way I thought they'd be since it's based on a handwritten representation of the character. Many characters also have multiple FC numbers! The FC method was never meant to uniquely identify specific characters, but just to help narrow down a list of candidates in a dictionary. Funnily enough, I also argue something similar in this thread that has the same drawbacks :)

[0]: Because I was interested in typing characters while not actually knowing the kanji. The Tagaini Jisho app (https://www.tagaini.net/) was indispensable because it lets you search on multiple parameters including partial FC # and simpler methods like SKIP codes (http://nihongo.monash.edu/SKIP.html). The only characters I couldn't transcribe with this method were those printed so small that the individual strokes were difficult to make out.

segfaultbuserr · on April 12, 2019

Don't know about Japanese, but other well-known input schemes for Chinese includes Cangjie (https://en.wikipedia.org/wiki/Cangjie_input_method), Zhengma (https://en.wikipedia.org/wiki/Zhengma_method), and Wubi (https://en.wikipedia.org/wiki/Wubi_method).

In fact, all of these non-phonetic input/encoding systems are highly non-intuitive and have a reputation of sharp learning curves, frustration is expected. This is because, in general, Chinese or Kanji characters are expected to be pronounced or written by the speakers, not to be indexed in a particular encoding system. Only the pronunciation is the natural form in the language.

The encoding schemes are completely foreign, arbitrary to the native speakers. Using them requires extensive and systematic training. In mainland China, Hong Kong and Taiwan, in the 80-90s, learning to use a computer often starts from learning the code, and it needs at least two months of mechanical memorization to get started, and years of use to master it, just like how amateur radio operators learn Morse Code (edit: well, you don't need to memorize the code for every single character as if it's Morse Code, but remembering the standard decomposition of characters in the system is comparable to rememebr the Morse Code table). And remember, these are native speakers, much greater effort is needed for foreign speakers.

Sure, schemes based on radicals have been used for a thousand years in dictionaries, but all of these schemes used today are a completely artificial creation for typing and searching things into/from computers (often on those with very limited computing power).

The increase of processing power of personal computers in the late 90s allowed phonetic input systems to map pronunciation to characters heuristically, with high correctness rate. So those codes are rarely used by Chinese, and Japanese speakers (I believe) today.

Unless you've learned computing in the during 80s to mid 90s, or you have a job related to language or word processing that requires typing tens of thousands of characters or creating/searching them in a language-related database, or you are someone who emphasize typing efficiency.

> Because I was interested in typing characters while not actually knowing the kanji.

This is actually a common requirement for people with those jobs, and one of the biggest reason to keep using them. It is especially useful when transcribing texts to computers or searching them in a database.

Users also argue that, using them help preventing the modern disease of forgetting the writing of characters due to computerization, which I do see a point, similar to the spell-checker problem in English education.

jwong_ · on April 12, 2019

Phonetic systems for Chinese aren't so prolific for non-mandarin speakers.

9 square/Q9 is another popular method.

The stroke-based methods aren't so arbitrary - they are based on the way you write.

segfaultbuserr · on April 12, 2019

> The stroke-based methods aren't so arbitrary - they are based on the way you write.

Indeed, they are not arbitrary.

The very invention of them is meant to create something much more meaningful to the users as alternative to the telegraph code and alike, which is nothing more than a bunch of numbers. But I say "arbitrary", in the sense that the classification and organization of strokes in the system is selected artificially by the designers, not something inherently exists in the language and understood by the speakers, difficulties are overcame once you are familiar to the system. But extensive effort is needed for a new user to master the system.

bobthepanda · on April 12, 2019

Based on extremely limited Googling, one of the cases where these codes are still used is written colloquial Cantonese, which lacks any major official support.

segfaultbuserr · on April 12, 2019

Good point.

Some forms of Cantonese romanization (https://en.wikipedia.org/wiki/Hong_Kong_Government_Cantonese...) does exist, but their use is limited to mainly language study and transliteration. And apparently, there are multiple projects to create phonetic input systems for Cantonese (see the reference section of this Cantonese Wikipedia article, https://zh-yue.wikipedia.org/wiki/%E7%B2%B5%E8%AA%9E%E6%8B%B...), but all with very limited standardization and official supports.

bobthepanda · on April 12, 2019

As someone who has attempted to learn to get back to my roots, I can definitely say the lack of a standard romanization is a huge barrier. Cantonese textbooks are not interchangeable because each series decides which romanization to use and some even take the liberty of creating their own.

The only people that could sort this out is a government.

- Guangdong won’t, because official Party policy at best discourages use of regional languages. There was a lot of fury back when they attempted to stop provincial broadcasts in Cantonese.

- Hong Kong won’t, because the Government likes to fumble around a lot these days, and because they have a more or less unofficial goal to integrate into China. Cantonese is not an official language, only “Chinese”.

- Macau won’t. They’re too busy trying to sweeten up China to let in more gambling tourists.

The other issue is that there are now some phonetic changes between the mainland and Hong Kong, the two places people would consider authoritative on the subject.

Grue3 · on April 12, 2019

>Do Japanese users actually expect names to be sorted phonetically?

Yes. Usually when there's a list of things you'd see subsections like あ (words that start with a vowel), か (words that start with "k" or "g"), さ (words that start with "s" or "z") and so on, and there's a specific order within each subsection.

jwong_ · on April 12, 2019

Imagine having your contact list sorted by some geometric function run on each letter of the A-Z alphabet.

trw999 · on April 12, 2019

That would be marginally better than memorizing ABCDEFGHIJKMNOPQRSTUVWXYZ. Is there anything inherent about the letter A that makes it get sorted in front of the letter Z?

fenomas · on April 12, 2019

> Do Japanese users actually expect names to be sorted phonetically?

Yes - or that's how a human would sort them, at any rate.

radicsge · on April 13, 2019

Think about numbers, it is not sorted phonetically, Even the alphabet is not. All those languages that use alphabet sort in exactly the same order regardless the pronunciation.

trw999 · on April 12, 2019

I don't think you're in a position to speak for all of humanity.

fenomas · on April 13, 2019

I meant a human as opposed to an algorithm.

Point being: kanji words aren't always sorted phonetically, for the reasons described in the article, so a user may not be surprised if they aren't. But when a human is sorting kanji words they do so phonetically by reading.

LeoDT · on April 12, 2019

East-Asians usually put the last name before the first name, and most Japanese last name pronounced different, so there is no problem to them phonetically.

quelltext · on April 12, 2019

>In English, the logo goes with their saying: “Everything from A to Z.” This is indicated by the arrow. But in Japan, and any other country that doesn’t use English, A and Z aren’t always the first and last letters of the alphabet.

This is grasping for straws.

a) Many English speakers/countries are unfamiliar with the meaning of the logo's arrow. b) They will get the meaning if explained. c) Some won't because A-Z standing for "everything" requires a given level of literacy (cf. alpha and omega) d) They will get it easily when explained because it's a simple concept.

Japanese people learn the English alphabet pretty early on in school and while some may not be familiar with the saying A-Z the logo and the meaning still perfectly works and would get an aha reaction when explained.

level3 · on April 12, 2019

For those who want to know more about this:

The article touches on just the tip of the iceberg. You might think that all you need to do is add an extra field for phonetic readings, and then simply sort on that field, but there are a lot of things that can go wrong. A naive sort (i.e. based simply on character code) will hit the following snags:

1) Hiragana vs Katakana

The article focuses more on Kanji vs Kana, but Japanese users will expect Hiragana and Katakana to be properly sorted together. Either you normalize your sort field (by converting everything to Hiragana, for example) or you use a Kana-insensitive collation.

2) Half-width characters

Katakana can be encoded as full-width or half-width characters (カ vs ｶ). Generally you want these treated as the same, so again you need to normalize or use a width-insensitive collation. There are also full-width alphabet characters (Ａ vs A).

3) Youon

These are actual different characters (ゆ vs ゅ, つ vs っ), so you can't normalize, but you want them sorted together. Here you need a collation that's case-insensitive with respect to these.

4) Dakuten/Handakuten

Like youon, these are also different characters (は vs ば vs ぱ) so you can't normalize, but you want them sorted together (insensitively). A sensitive sort will give you (はね, ばつ, ぱすた) while an insensitive sort will give you (ぱすた, ばつ, はね).

There has been a lot of work around this, resulting in many different database collations over the years, some of which result in sorts that would greatly confuse Japanese users. As of today, you probably want to be using (in the case of MySQL) utf8mb4_0900_ai_ci or something similar.

txtsd · on April 13, 2019

I'd assume you'd want ゅ andっ to be added to the kana they're attached to, and then sorted. I'd want my き and きゅ together and び and っび together.

level3 · on April 13, 2019

That does make sense in a way, but I don't think that would feel natural to any native Japanese speaker. I'm not native and even I would find that ordering very odd.

At the very least, it would make the sorting algorithm a lot more complex if you had to look ahead at later characters in order to sort the current prefix.

cooper12 · on April 12, 2019

Just a thought experiment, don't take it too seriously:

The crux of the issue is that kanji don't have an inherent "natural" ordering that a user would expect. Sorting by their character code doesn't mean anything to a Japanese person. But, what if we made our own standard of what entails a "natural order". There's nothing about A–Z that makes the alphabet obligated to be in that order (and not something like based on sound or shape) other than it being the convention that developed. Even hiragana can have different orderings (AIUEO vs IROHA [0])

One proposed method would be to do it how the dictionaries do it: first sort by major radical, [1] and then by stroke count. This is something most Japanese learn when learning how to write characters anyway (of course ambiguities would arise when the radical is shared and the stroke count is the same, but we could just choose a third arbitrary factor; we'd also have to decide on a specific written form as stroke count can differ depending on whether it is handwritten or the font).

We could then teach our approach to schoolchildren and it would just become accepted over time like other things they learn. But wait you say, it's more natural for them to sort on pronunciation. However, if I gave you a list of polygon names and told you to sort by the number of sides they had, you'd be perfectly capable of doing it despite that not being alphabetical. Things are less "unnatural" if you grew up learning them and your brain doesn't experience dissonance.

Anyway, just my hot take.

[0]: https://en.wikipedia.org/wiki/Iroha

[1]: https://en.wikipedia.org/wiki/Radical_(Chinese_characters)

_0nac · on April 12, 2019

The only problem with that is that the resulting order would be useless for many applications. Say you're looking up your friend Tanaka Tarou, but you're not sure which characters his name is written with. If the sort order is phonetic, you can find the name and likely work out that this is the Tanaka you were looking for. But how do you search for a name in a kanji-indexed list if you don't know the kanji?

Incidentally, this is why kanji dictionaries invariably have multiple indices: one by radical, the other by pronunciation(s).

cooper12 · on April 12, 2019

Great point. I was considering a very visual-minded reader but of course not everyone would be so good at it nor would they always just care about how the kanji looks rather than other aspects. It's a difficult problem indeed... My intention is to show that we'd need some sort of radical solution that might not be what we'd immediately jump to (for example the current approach the author mentions is having a separate field for readings, but this is clearly resource-intensive and wouldn't work on arbitrary data). To solve sorting for Japanese, I feel we need to rethink what it means to sort.

mjevans · on April 12, 2019

The concept of having a displayed value and a value to actually sort by isn't limited to Japanese words/names. It also comes up when there's a desired display function but for sorting intents different parts should be added or removed from the name of an entity. One such example is book and movie titles in a library.

innocenat · on April 12, 2019

> Even hiragana can have different orderings (AIUEO vs IROHA [0])

Unless otherwise mentioned, things are almost always AIUEO nowadays.

tmm84 · on April 12, 2019

I've seen this problem solved a few different ways.

1) Have a romanized version of the value to sort. 2) Have a hiragana or katakana version of the value to sort (hiragana > katakana > roman order for sorting).

Excel seems to sort things without telling it the reading for something. In JS and a few other languages Japanese is sorted based on UTF-8/16 code. This works for everything but kanji because of the reading requires a human.

Technetium_Hat · on April 12, 2019

In JavaScript, I imagine using string.localeCompare would give a more helpful sort than just UTF-8 value.

GorgeRonde · on April 12, 2019

I think taking people names as an example is extreme because when Japanese people give a name to a baby, they choose both a phonological sign and a kanji-based transcription: phonologically, they tend to pick quite a common name (like christian names in the west, contrasting it with the native-american names that are a lot more specific to the person), but try to be clever and original when it comes to writing it with kanjis.

And I'm not even sure Amazon has an additional input field just for the sake of sorting names. Isn't this a common practice in the country ? (if I have to call a customer, how am I supposed to greet her if I can't pronounce her name ?)

aliswe · on April 12, 2019

If im not mistaken, another unsolved problem is generating url segments (or "slugs") from characters such as chinese, arabic and possibly japanese as well.

kijin · on April 12, 2019

You just use a (somewhat cleaned up) UTF-8 representation as a slug. UTF-8 URL components are universally supported (transparently encoded and decoded) in all modern browsers, and Google shows the real characters instead of the urlencoded version. You still end up with the urlencoded version when you hit Ctrl+C, but I expect that to be fixed in the near future as Unicode becomes even more widely used.

For example, the following link should work perfectly well in all modern browsers:

https://ja.wikipedia.org/wiki/メインページ

augbog · on April 12, 2019

Ran into this issue at work the other day. Relevant SO question: https://stackoverflow.com/questions/54543528/intl-collator-s...

rootsudo · on April 12, 2019

This is great, I never took this into consideration and I actually enjoy Japanese as a hobby. JLPT N4 and I'm astounded to "wow, yeah, that makes sense it's a problem."

I always like unique problems like this, I never considered ABC vs Kana, Kanji -- and Romaji.

seanlinmt · on April 12, 2019

Interesting. I didn’t know this was a problem. Kanji uses chinese characters. If sorting has been solved for chinese character, I’m assuming that the existence of Chinese dictionaries mean that sorting is no longer an issue, then why can’t the same method be used for kanji characters?

Razengan · on April 12, 2019

I'm not native to Japanese or English, but I do think Kanji is beautiful and kana is "better" than the Alphabet, and although I dislike resorting to whataboutisms† to assert a point, @everyone here who suggests the abolition of kanji, please consider this proposal for "simplifying" the English language:

• Consolidate letters with identical pronunciations into a single letter: A/E, E/I, C/S, C/K, C/Q, G/J, I/Y, K/Q, U/OO, V/W, X/KS

• Split letters with ambiguous/variable pronunciations into multiple letters with fixed pronunciation: A, C, E, G, I, O, S, U, X, Y

• Remove any "silent" letters from all words.

• Disambiguate homographs and homonyms+homographs (e.g. lie, fair.)

I'm sure I missed or left out many examples, but would any of these proposals fly, at all?

Do they not disregard all the subtle nuances and historical significance which only advanced learners might appreciate? Do you feel a little aghast at my naiveté for even suggesting this? What would be the response of native speakers to a foreigner campaigning for a more streamlined and consistent English?

† or more accurately: Tu quoque

----

I'll just leave a piece of fun trivia as an example of the cool stuff that would be lost if kanji was abolished:

The "slang" for a female ninja is kunoichi: く ku, ノ no, 一 ichi (the first two characters being katakana)

and the kanji for woman is 女, made up of the following strokes: く, ノ, 一

It's something that I noticed on my own some time ago and it brought a smile to my face, like someone discovering an in-joke. There is a lot of arguably-clever wordplay like this in Japanese, and losing that would just make things bland for everyone, and for what? Just to make the language a little easier to stomach by Westerners?

What we should be campaigning for, is better resources and tools for learning and looking up kanji.

ErotemeObelus · on April 13, 2019

The correct word is collation.

yutori · on April 12, 2019

Protip: Kirakira name

expat2003 · on April 12, 2019

Some statements you have made about the Japanese Language are ambiguous at best. Let's clear them up, but first let's be precise with the terminology. Pronunciation and sound have not interchangeable meaning. Every sound -or sequence of- in Japanese is associated with a specific character. It happens that some Kanji have multiple sounds or that a group of Kanji shares the same sounds, but still how are read it's univocal. There is no pronunciation involved. Just to give a quick example, the sound of あ is one and only one, while in English the sound of `a` has different pronunciations in pal, Paul, paediatrics, and so on. For the sake of simplicity it's ubiquitous savage practice overlapping the use of the "reading" with the one of "pronunciation" and the one of "intonation", but you should know the difference. The pronunciation transforms based on the preceding and/or following characters. There's no such concept in Japanese. The sounds of the Japanese language are distinctively dictated by Hiragana. Therefore what you have is a "Reading". A reading is made of one or more sounds. But once again readings or sounds have no multiple pronunciations. I hope I was able to shed some light on this complicated subject. It's a nuance, but makes an important impact.

Quoting: "I should note that there are two different alphabetical sorting orders in Japanese. For this article I am going to use the a i u e o (あいうえお) sort order." Alphabetical order as we know it, it's one in Japan too. The order of the Hiragana works exactly like our alphabetical order: a preordered sequence of scripts from A to Z is the methodologically equivalent to the Hiragana from あ to ん, albeit there is no words starting with を and ん - last two characters of the alphabet. The Katakana alphabet is exactly like Hiragana, just the characters `design` changes - it is only used to write, in Japanese sounds, imported words from non-Japanese languages (Chinese and Korean are exception because you can use the Kanji to write them, although when writing just Chinese or Korean sounds, you will be using Katakana still). When talking about `ordering` we normally refer to which sequence we are listing the Kanji: by their Hiragana sounds is the most common way of listing them. They can also be listed by what in English are called Radicals - foundational shapes composing the Kanji ideograph, they can be listed by strokes count or be listed by their Chinese reading - the Japanese sound of the equivalent Chinese Kanji, or even by recurrence. In the Japanese language it's frequently used `ordering` by meaning, by which Kanji may or may not be listing, where only Hiragana is used to write all the words. This method is the closest rendition in Japanese to what we are accustomed to name English dictionary. More here https://en.wikipedia.org/wiki/Japanese_dictionary I think it's already clear where the problem of sorting lays. Kanji, Hiragana and Katakana, though being different looking alphabets, have a well-defined scope and interconnection in the Japanese language. Kanji are words, thus have meaning; Hiragana offers a vocalization to the Kanji and also absolves the crucial grammatical role, providing the language with adverbs, prepositions, particles, determiner, verbs conjugation and more. Katakana usage is sidelined to just words foreign in nature, hence they exclusively represents sounds, carrying no meaning. Consequently when reading Japanese on a generic subject Kanji are mostly encountered, some Hiragana that connects them are present (it needs to be said that very common words are often written in Hiragana only, e.g. こんにちは means Hello) and sporadic Katakana when some imported word is used. This is happening because in the Japanese language there is hardly any punctuation at all - yes, there's a full stop, a comma, a way to encapsulate direct speech, but no concept of empty space. You can read Japanese from any direction you set up yourself for. Traditionally it's read top to bottom from left to right. Modern books are read as ours would be, even so sometimes you turn the pages backward to further the reading, e.g. comics etc. The English alphabet is used in case of very technical terms or for proper names - the context may sometime better served with romaji. But that's not a rule whatsoever. Katakana is employed more often. The Western alphabet -romaji in Japanese- adoption is similar to how you would take up French, Spanish, Italian or German words in your writing. (continue)

expat2003 · on April 12, 2019

Quoting from "Sorting Settings": "In this example you can see ABC and katakana are separated. Kanji are then separated from katakana. There were no hiragana in this list[...]" The Hiragana on that list are the characters と and の. On that particular list の expresses the meaning of correlation and と convey the mere meaning of `and`. For instance 地域と言語のオプション (notice I myself have entered the `spaces` for clarity's sake, but there should be none) is a very instructive example. Those are actually three distinct nouns you are trying to order as one: 地域 reads ちいき (Hiragana for `chiiki`), means area; 言語 reads げんご (Hiragana for `gengo`), means language; オプション read opushon, is, you guessed, option in English. Note I didn't write that オプション means option. オプション IS the word `option` written with Japanese sounds or, in other words, written in the Japanese script - i.e. Katakana, you guess it. So 地域と言語のオプション can be translated as "Regional and Language options".

Quoting from "Sorting Names" "It is very possible to have different people with the same name write their name in different character sets. The traditional way of writing the Japanese name of Ayumi would be written in kanji; a modern, stylish way would be to write it in hiragana, and a second generation Japanese-American might write their name in katakana or the alphabet." Japanese always writes their name in Kanji. They don't use different sets. When the Kanji composing their names can assume several sounds, they write alongside what's called Furigana, the Kanji reading - usually as a subscript or superscript. Furigana is written in Hiragana (not Katakana as you stated later on), but for the Internet where websites could potentially be read by a non-Japanese crowd, Katakana might be used in some cases. Nonetheless as I said earlier Hiragana and Katakana differ only by scripting style, if you wish to call it as such. So which script is in use, it's not a relevant issue for foreigners. All the other statements are a matter of opinion, save for the last: Japanese do write sometimes their names in the romaji, aka "ABC alphabet", especially when they are dealing with foreigners at any level. Although why, being 2nd gen Japanese-American entails writing its own name in Katakana, beats me. It's a tiny bit like saying, forgive me here, a 2nd gen Italian-American would write its name in Latin.

Quoting from "Kanji - The Real Problem": "Kanji have multiple pronunciations, determined by the context in which it appears.[...]Only from the context in which the kanji appears do you know how to pronounce it." That's like saying the pronunciation of the word 'pool' is determined by the context referring to water or balls. Not the pronunciation, but the meaning of the word is changed by the context. We all agree on this statement. Single Kanji words change along with their meaning based on the context, but like the word 'pool', its reading is the same. e.g. あめ (reads ame) which means 'candy' if written 飴 or 'rain' if written 雨 . In this case the Kanji itself controls its meaning and reading, not the context. When single Kanji is a verb the meaning and reading changes with what's called Okurigana - Hiragana written after the Kanji absolving the purpose of conjugation -, while the Kanji remains invariant. For example 着く and 着る reads つく (tsuku) and きる (kiru) respectively, the former means 'to arrive' and the latter 'to wear'. Single Kanji aren't at all like you describe them. Compound Kanji words follow a different ruleset build upon how many they are. Generally speaking one kanji in two-Kanji words has multiple readings depending on what is the word it appears in and where it appears in that word. You can learn the rules, or you can get used to them just by seeing them used in massive frequency. This case alone is as you've described: the need to know the context in which the Kanji lives. But since the meaning also changes with its reading, you could be able to catch the overall meaning of the sentence without being able to read that single word. But compounds Kanji with multiple readings aren't that recurring and they generally represents common words requiring not much effort to memorize. Compound Kanji words composed by more than two Kanji reads unequivocally in one way as English words do with very few exceptions. More here https://ja.wikipedia.org/wiki/%E5%90%8C%E5%BD%A2%E7%95%B0%E9... If anything what could, quote "[...]keeps students up nights studying for years[...]" isn't the multiple Kanji readings, but the fact that you need to know between 2000 and 3000 Kanji and its their combinations that build words (mostly two Kanji words). So it's like having a permutation with repetition (in ordered arrangements) of 3000 syllables that makes words in pairs or singularly.

Quoting "Here is an example: 私は私立大学で勉強しています。[...]A second year Japanese student could figure this out. For a computer, this is a very difficult problem." The choice is particularly sad. This isn't difficult at all for a computer, granting you understand the Japanese language. Let's dissect this phrase: 私は私立大学で勉強しています。私わたし (reads watashi) is the English pronoun 'I'. A computer instantly knows it because only the watashi reading/meaning can be followed by the Hiragana は. That's something a 6 years old Japanese knows. And something you would learn in your first months or less of Japanese studies. Conversely a computer know instantly that's the reading/meaning isn't watashi when it scans that following 私, there's another Kanji. This compound Kanji 私立 reading can only be しりつ (reads shiritsu) out of a staggering number of combined readings of 2 - and that's only because 立 has two usable Chinese reading りゅう and りつ (the third would require the Kanji to be lonesome), 私 only one, し. Kanji have usually one Japanese reading and one or more Chinese readings governed by strict rules on which reading group has to be used. Coding it isn't as much of a headache. As soon as the computer realize that the third character that follows the first two Kanji is a Kanji as well, the range of possible readings bottoms. That's also due to the fact the first two Kanji makes already a word - as often happens with 2+ long Kanji words, they are compose of multiple words, just like some long words in Western languages would - that means 'private' in English. With the same approach the computer instantly finds the reading of the two Kanji 大学だいがく (reads daigaku) means University, another very common noun. I think you already got the gist of it. Last word 勉強しています the computer know instantly is a verb because of the unique okurigana しています (read shiteimasu) Present Continuous of "to do" and 勉強 is both extremely popular and has a unique reading, べんきょう (reads benkyou) which is the noun 'study'. "I'm studying at a private university", even a machine translation would be accurate here.

I think the point is that there is no use in sorting all the words written in the three different Japanese alphabets simultaneously in the same juncture. Microsoft knows it so well it has yet to implement it. In your final thought you completely miss to understand that you don't need to attack the problem by "pronunciations". You have only to treat the Kanji with different approach and translate Hiragana/Katakana in romaji, which it has been done already long time ago. I hope at least you're going to quit using "pronunciation" in favor of "reading" by the time you've done reading this post. If ever.