blog

Testing Google’s Language Detection

google code

As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world’s popular languages. Google’s handling of languages of course isn’t perfect, however—in particular, I’ve been complaining to friends for a while about the weaknesses of Google’s handling of queries in Chinese character (漢字/汉字) scripts. In this post, I run some tests using Google’s Language Detection service to try to better understand its handling of Chinese character queries.

Background

Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the “CJKV”). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there’s Wikipedia or Ken Lunde’s CJKV Information Processing.

The problem

The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, Google doesn’t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.1

Take, for example, a query like “七面鳥”, meaning ‘turkey’ in Japanese. While all characters are very common in traditional Chinese (鳥 is simplified to 鸟 in China), the combination “七面鳥” is quite rare in Chinese. However, when you search for “七面鳥,” many of the first results are in Chinese and only two of the first ten results are in Japanese.

Does Google’s corpus not identify “七面鳥” as a primarily Japanese word? Google does indeed attest to this fact: searching for “七面鳥” and limiting to a certain language yields the following number of hits. A similar effect can be seen with Japanese words such as “芝生” (‘grass’) or “泥棒” (‘burglar’). The “Japanese on first page” column gives the number of results that are in Japanese which come up in a language-unspecified search from the US.

 Chinese (simplified) Chinese (traditional) Japanese Japanese on
first page
七面鳥786926395,0002/10
芝生55,600216,0002,230,0000/10
泥棒13,50022,50010,400,0003/10

In a perfect world, I would like Google to identify the language that the query is in, and then weigh results that are in that language higher in the results list. So the issue comes down to one of language detection.

There are broadly two different approaches to language detection and, indeed, all natural language processing problems: parsing and counting. In this case, parsing involves trying to break apart the query into words and then computing how likely such a string of words is in each given language. Counting simply takes an inventory of the characters given and compares them to their frequencies in each language, computing how likely such a string of characters is in each language. Parsing is the “smarter” approach, but more difficult and computationally intensive.

Google was kind enough to give us an language detection AJAX service so we can get a sense for how their language detection works. This service also gives a “confidence” value on the detection result. For the rest of this entry, we’ll test some hypotheses against this service and conclude at the end.

Do spaces matter?

No. While spaces are sometimes used in Japanese and Chinese writing to represent word boundaries, especially around numbers and roman letters, they also are seen on the web to encourage line breaks. It would make sense for Google’s language detection service to ignore spaces in Chinese character queries and that does seem to be the case. All tests I ran with Chinese character queries gave the same result with same confidence with and without spaces in random places.

Does order matter?

No. This was slightly disappointing to see. I took the Japanese string “骨粗鬆症” (‘osteoporosis’, if you’re curious) and ran every permutation against the language detector and got the same results, including the same confidence values. This is a clear indicator that Google uses only counting, not parsing, in their parser.

Does repetition matter?

Yes. Now that it seems that Google does not use any parsing and only uses character frequencies in identifying the source language, let’s see how repetition can affect the detection service.

First, I took some Chinese character strings and ran them through the detection service with different numbers of repetitions, e.g. “参加”, “参加参加”, “参加参加参加”, “参加参加参加参加”… The queries I used were the following:

 Chinese (traditional)JapaneseChinese (simplified)
XXX
漢字XX 
X  
參加X  
参加 XX
 X 
  X

For each token type, the detection service made up its mind quite quickly. Its confidence, however, was more interesting.

Each of the confidence values dips sharply after three, five, or ten repetitions. Note, however, the length of the tokens which dipped at each of those points. I interpret this to mean that there is a different parser for less than ten characters and ten or more characters. However, the detection service did not change its answer after this point on any of the tokens.

Second, I took two characters, “簡” and “体,” and crossed different numbers of them together to see how that would affect the language detected. Note that “簡” is used in traditional Chinese and Japanese, while “体” is used in simplified Chinese and Japanese.

 簡x0簡x1簡x2簡x3簡x4簡x5簡x6簡x7簡x8簡x9
体x0  0.995 0.998 0.998 0.999 0.999 0.999 0.999 0.999 0.999
体x10.995 0.998 0.998 0.999 0.999 0.999 0.999 0.999 0.999 0.999
体x20.998 0.998 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.531
体x30.998 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.52 0.568
体x40.999 0.999 0.999 0.999 0.999 0.999 0.999 0.516 0.565 0.613
体x50.999 0.999 0.999 0.999 0.999 0.999 0.512 0.561 0.609 0.657
体x60.999 0.999 0.999 0.999 0.999 0.507 0.556 0.605 0.653 0.702
体x70.999 0.999 0.999 0.999 0.502 0.551 0.6 0.649 0.697 0.746
体x80.999 0.999 0.999 1 0.545 0.595 0.644 0.693 0.741 0.79
体x90.999 0.999 1 0.539 0.589 0.638 0.687 0.736 0.785 0.834
JapaneseChinese (traditional)Chinese (simplified)

Conclusion

For Chinese character-based languages, Google’s language detection algorithm uses simple counting rather than parsing, identifying languages by looking at the frequency of characters rather than the frequency of words. As such, the algorithm simply acts as a script detector, not a language detector. Moreover, as a simple counting method is used, duplicating characters used in one language but not another can very easily skew the resulting output.

As a trivial aside, it seems that Google’s algorithm is slightly different for strings less than ten characters, as can be seen in a dip and then rise of confidence values after ten characters.


  1. Just to complicate matters further, there’s also the issue of where you’re accessing Google from. For example, accessing from the US (or via my friend VPN), a query for the Japanese-simplified “天気” seems to only return Japanese pages. However, accessing from Taiwan, Google assumes you may have meant the full-form “天氣”, giving you pages with both “天気” and “天氣”. As a result, Yahoo Japan weather is the first result from the US and third from Taiwan, while Yahoo Taiwan weather is first in Taiwan and doesn’t even show up from the US. This default character substitution in Taiwan is one of my least-favorite Google “features.”

    Similar effects can most likely be seen between the US and China. In the rest of this post, all queries will be made from the US. 

Related posts:

  1. Taipei find: a dictionary of Chinese-Japanese false cognates: The fact that Japanese and Chinese both share the use...
  2. 回収 vs. 収集 and Better Word Meanings Through Usage: Bailey just asked me what the difference between 回収 (kaishū)...
  3. Bailey won the Japanese Language Speech Contest: Bailey just won the Grand Prize at the 22nd Annual...
  4. Setting Language Research to Music: Via LinguistList: ‘Setting Language Research to Music’ is a Newcastle...
  5. Sign language SuperBowl ad: I don’t care much for the game, but always love...

Related posts brought to you by Yet Another Related Posts Plugin.

Tags: , , , , , , , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!


© 2006-2008 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress.
Entries (RSS) and Comments (RSS).
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.