<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; Google</title>
	<atom:link href="http://mitcho.com/blog/tag/google/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Fri, 10 Feb 2012 23:24:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-19719</generator>
		<item>
		<title>回収 vs. 収集 and Better Word Meanings Through Usage</title>
		<link>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/</link>
		<comments>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 14:50:27 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[Bailey]]></category>
		<category><![CDATA[cognitive linguistics]]></category>
		<category><![CDATA[corpora]]></category>
		<category><![CDATA[corpus]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[frame semantics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language learning]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[synonymy]]></category>
		<category><![CDATA[translation]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=721</guid>
		<description><![CDATA[Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction: 回収 may take things away from others when collecting while 収集 does not have that implication. Things that you 回収 may [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://bpick.tumblr.com/">Bailey</a> just asked me what the difference between 回収 (<em>kaishū</em>) and 収集(<em>shūshū</em>) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction:</p>

<ul>
<li>回収 may take things away from others when collecting while 収集 does not have that implication.</li>
<li>Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.<sup id="fnref:3"><a href="#fn:3" rel="footnote">1</a></sup></li>
</ul>

<p>Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: <a href="http://en.wikipedia.org/wiki/Google">Google</a>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup> To test my hypothesis, I chose two &#8220;objects of collection&#8221;, one you can take away (and often is distributed first) and one you can&#8217;t take away: アンケート (<em>ankēto</em> &#8220;survey,&#8221; from the French <em>enquête</em>) and 意見 (<em>iken</em> &#8220;opinion&#8221;). I then took the four resulting collocations<sup id="fnref:1"><a href="#fn:1" rel="footnote">3</a></sup> on Google in quotes (&#8220;•&#8221;) and recorded how many hits there were.</p>

<p><span id="more-721"></span></p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;意見を収集&#8221;</th><th>&#8220;意見を回収&#8221;</th><th>&#8220;アンケートを収集&#8221;</th><th>&#8220;アンケートを回収&#8221;</th></tr>
<tr><td>218000</td><td>6200</td><td>784</td><td>169000</td></tr>
</table>

<p>A better way to organize this data is as follows:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;↓を→&#8221;</th><th>回収</th><th>収集</th></tr>
<tr><th>アンケート</th><td>16900</td><td>784</td></tr>
<tr><th>意見</th><td>6200</td><td>218000</td></tr>
</table>

<p>This data clearly supports the hypothesis I laid out above: アンケート, which can be taken away from people and is often distributed first, occurs much more likely with 回収 than 収集. 意見, on the other hand, which crucially cannot be taken away when collected, occurs much more likely with 収集 than 回収.</p>

<p>While this one example doesn&#8217;t <em>prove</em> anything in and of itself, it does help clarify with data a nuance between two near synonyms. While my hypothesis was borne out here, native speaker intuitions on word nuances and distinctions can be unreliable.<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup> This type of quick test can be very helpful for language learners and instructors alike.</p>

<p>Languages very often have words which vary in very subtle ways. Just this Tuesday I went to a <a href="http://linguistic.meetup.com/58/">Tokyo Language Exchange Meetup</a>, a great <a href="http://en.wikipedia.org/wiki/meetup.com">meetup</a> which brought together various language learners and enthusiasts. A hot topic that night was words with very similar meanings—near synonyms. A few English learners were lamenting sets of words like {see, view, watch} and how difficult they are to learn. I myself have had the same experience studying Mandarin.</p>

<p>I noted that these difficulties in offering contrasting definitions often are due to the fact that word meanings are not just &#8220;what the word points to&#8221; but also the implication of &#8220;what it relates to&#8221;.<sup id="fnref:5"><a href="#fn:5" rel="footnote">5</a></sup> For example, &#8220;unborn baby&#8221; and &#8220;fetus&#8221; may point to the same thing, but are used in different contexts, in contrast to different other terms, for differing effect. Similarly &#8220;Death Tax&#8221; and &#8220;Estate Tax.&#8221; &#8220;Kneel&#8221; and &#8220;genuflect.&#8221;<sup id="fnref:6"><a href="#fn:6" rel="footnote">6</a></sup></p>

<p>The concept of word meanings being &#8220;what it points to&#8221; and &#8220;what it relates to&#8221; also helps explain why certain words are difficult to translate. Fillmore uses the Japanese example of ぬるい (<em>nurui</em>) which is the de facto translation of &#8220;lukewarm.&#8221; However, some Japanese speakers will only use ぬるい in contrast with &#8220;hot,&#8221; i.e., hot tea can become ぬるい over time but ice water does not become ぬるい. In contrast, English &#8220;lukewarm&#8221; can be used to describe things that are initially or prototypically hot or cold. &#8220;What the words point to&#8221; in this case is the same but &#8220;what it relates to&#8221; or, here, &#8220;what it contrasts with&#8221; is different, making it an imperfect (though very close) translation.</p>

<p>Every language has near synonyms which vary slightly in nuance but this nuance or &#8220;feeling&#8221; is borne out objectively in data. Looking at what words certain terms relate to <em>in real usage</em> is often the key to getting a richer understanding of vocabulary.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:3">
<p>This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (<em>mawa=ru</em>) can mean &#8220;circle back.&#8221;&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the <a href="http://en.wikipedia.org/wiki/n-gram">n-gram</a> <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">data they offer</a> for research.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p><a href="http://en.wikipedia.org/wiki/collocation">&#8221;Collocation&#8221; on Wikipedia</a> says: &#8220;Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.&#8221;&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>Hm&#8230; I just made a claim&#8230; looking for a citation.&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:5">
<p>&#8220;Relates to&#8221; here is not meant in an etymological sense. In <a href="http://en.wikipedia.org/wiki/frame semantics (linguistics)">frame semantics</a>, a part of <a href="http://en.wikipedia.org/wiki/cognitive linguistics">cognitive linguistics</a>, the &#8220;what the word points to&#8221; may be called a <strong>profile</strong> while the &#8220;what it relates to&#8221; is called the <strong>(semantic) frame</strong>. These distinctions are due to the work of <a href="http://en.wikipedia.org/wiki/Charles J. Fillmore">Fillmore</a> 1976.&#160;<a href="#fnref:5" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:6">
<p>The great examples in this section come from Bill Croft and D. Alan Cruse&#8217;s <em>Cognitive Linguistics</em>, 2004&#160;<a href="#fnref:6" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Free licenses for Mailplane 2.0—Mailplane 2.0 の無料ライセンズ</title>
		<link>http://mitcho.com/blog/projects/free-licenses-for-mailplane-20%e2%80%94mailplane-20-%e3%81%ae%e7%84%a1%e6%96%99%e3%83%a9%e3%82%a4%e3%82%bb%e3%83%b3%e3%82%ba/</link>
		<comments>http://mitcho.com/blog/projects/free-licenses-for-mailplane-20%e2%80%94mailplane-20-%e3%81%ae%e7%84%a1%e6%96%99%e3%83%a9%e3%82%a4%e3%82%bb%e3%83%b3%e3%82%ba/#comments</comments>
		<pubDate>Tue, 19 Aug 2008 13:05:44 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[free]]></category>
		<category><![CDATA[Gmail]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[localization]]></category>
		<category><![CDATA[Mac OS X]]></category>
		<category><![CDATA[Mailplane]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=620</guid>
		<description><![CDATA[I&#8217;ve written before about Mailplane, a high-quality Gmail client with some great Mac-specific features. I&#8217;ve been happy to be associated with the project as its Japanese localizer. I recently completed the localization for the upcoming version 2.0. As a result, I&#8217;ve received twenty free licenses for Mailplane 2.0 from the developer, Ruben Bakker. Email me [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/mailplane-japanese-localization-available/' rel='bookmark' title='Mailplane Japanese localization available!'>Mailplane Japanese localization available!</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/08/mailplane-logo.png" alt="" title="mailplane-logo" width="200" height="200" class="alignnone size-full wp-image-621" /></p>

<p>I&#8217;ve <a href="/blog/2007/09/04/mailplane-japanese-localization-available/">written before</a> about <a href="http://en.wikipedia.org/wiki/Mailplane">Mailplane</a>, a high-quality Gmail client with some great Mac-specific features. I&#8217;ve been happy to be associated with the project as its Japanese localizer. I recently completed the localization for the upcoming version 2.0. As a result, I&#8217;ve received twenty free licenses for Mailplane 2.0 from the developer, <a href="http://mailplaneapp.com/info/background.html">Ruben Bakker</a>. <a href="mailto:mitcho+mailplane@mitcho.com">Email me</a> if you&#8217;re interested in one, and keep your eyes peeled for the 2.0 gold release.</p>

<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/08/e38394e382afe38381e383a3-2.png" alt="" title="screenshot" width="500" height="223" class="alignnone size-full wp-image-622" /></p>

<p>前にもここで<a href="/blog/2007/09/04/mailplane-japanese-localization-available/">話題にしたことは</a>あるが、今日は <a href="http://en.wikipedia.org/wiki/Mailplane">Mailplane</a> の新バージョンを発表しよう。 Mailplane は Mac 的な機能満載の Gmail クライアントで、Gmail 2.0 対応の最新バージョン (2.0) が近々リリースされる。 自分は Mailplane の日本語版担当なので開発者の<a href="http://mailplaneapp.com/info/background.html">ルーベン・バッカーさん</a>から Mailplane 2.0 の無料ライセンズを20件頂きました。欲しい方は<a href="mailto:mitcho+mailplane@mitcho.com">こちら</a>にメールしてください。</p>

<p>尚、日本ではもうすぐ Mailplane が <a href="http://macfan.jp/">MacFan</a> で紹介されるとのこと。楽しみ！最後に、日本語版で問題があると思ったら、<a href="http://blog.livedoor.jp/forestk/archives/50369904.html">勝手に書き上げる</a>前に直接教えてね。^^ お願いします。m(__)m</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/mailplane-japanese-localization-available/' rel='bookmark' title='Mailplane Japanese localization available!'>Mailplane Japanese localization available!</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/free-licenses-for-mailplane-20%e2%80%94mailplane-20-%e3%81%ae%e7%84%a1%e6%96%99%e3%83%a9%e3%82%a4%e3%82%bb%e3%83%b3%e3%82%ba/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Testing Google&#8217;s Language Detection</title>
		<link>http://mitcho.com/blog/observation/testing-googles-language-detection/</link>
		<comments>http://mitcho.com/blog/observation/testing-googles-language-detection/#comments</comments>
		<pubDate>Sat, 17 May 2008 09:47:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[Chinese characters]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[computational linguistics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language detection]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=254</guid>
		<description><![CDATA[As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/google-code.png" alt="google code" title="google-code" width="156" height="57" /></p>

<p>As <a href="http://googleblog.blogspot.com/2008/05/google-translate-adds-10-new-languages.html">Google adds ten more languages to its machine translation service</a>, it seems to be on its way to becoming the most convenient <a href="http://en.wikipedia.org/wiki/universal translator">universal translator</a> of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of queries in Chinese character (<a href="http://en.wikipedia.org/wiki/Chinese characters">漢字/汉字</a>) scripts. In this post, I run some tests using Google&#8217;s <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">Language Detection service</a> to try to better understand its handling of Chinese character queries.</p>

<h3>Background</h3>

<p>Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the &#8220;CJKV&#8221;). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there&#8217;s <a href="http://en.wikipedia.org/wiki/Chinese character">Wikipedia</a> or <a href="http://books.google.com/books?id=htlttpi1KOoC">Ken Lunde&#8217;s CJKV Information Processing</a>.</p>

<h3>The problem</h3>

<p>The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, <strong>Google doesn&#8217;t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.</strong><sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<p><span id="more-254"></span></p>

<p>Take, for example, a query like &#8220;七面鳥&#8221;, meaning &#8216;turkey&#8217; in Japanese. While all characters are very common in traditional Chinese (鳥 is simplified to 鸟 in China), the combination &#8220;七面鳥&#8221; is quite rare in Chinese. However, when you search for &#8220;七面鳥,&#8221; many of the first results are in Chinese and only two of the first ten results are in Japanese.</p>

<p>Does Google&#8217;s corpus not identify &#8220;七面鳥&#8221; as a primarily Japanese word? Google does indeed attest to this fact: searching for &#8220;七面鳥&#8221; and limiting to a certain language yields the following number of hits. A similar effect can be seen with Japanese words such as &#8220;芝生&#8221; (&#8216;grass&#8217;) or &#8220;泥棒&#8221; (&#8216;burglar&#8217;). The &#8220;Japanese on first page&#8221; column gives the number of results that are in Japanese which come up in a language-unspecified search from the US.</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (simplified)</th><th> Chinese (traditional)</th><th> Japanese </th><th>Japanese on<br />first page</th></tr>
<tr><th>七面鳥</th><td>786</td><td>926</td><td>395,000</td><td>2/10</td></tr>
<tr><th>芝生</th><td>55,600</td><td>216,000</td><td>2,230,000</td><td>0/10</td></tr>
<tr><th>泥棒</th><td>13,500</td><td>22,500</td><td>10,400,000</td><td>3/10</td></tr>
</table>

<p>In a perfect world, I would like Google to <strong>identify the language that the query is in</strong>, and then <strong>weigh results that are in that language higher</strong> in the results list. So the issue comes down to one of <strong>language detection</strong>.</p>

<iframe src="http://www.google.com/uds/samples/language/detect.html" width='400px' height="200px"></iframe>

<p>There are broadly two different approaches to language detection and, indeed, all natural language processing problems: <em>parsing</em> and <em>counting</em>. In this case, parsing involves trying to break apart the query into words and then computing how likely such a string of <em>words</em> is in each given language. Counting simply takes an inventory of the characters given and compares them to their frequencies in each language, computing how likely such a string of <em>characters</em> is in each language. Parsing is the &#8220;smarter&#8221; approach, but more difficult and computationally intensive.</p>

<p>Google was kind enough to give us an <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">language detection AJAX service</a> so we can get a sense for how their language detection works. This service also gives a &#8220;confidence&#8221; value on the detection result. For the rest of this entry, we&#8217;ll test some hypotheses against this service and conclude at the end.</p>

<h3>Do spaces matter?</h3>

<p><strong>No.</strong> While spaces are sometimes used in Japanese and Chinese writing to represent word boundaries, especially around numbers and roman letters, they also are seen on the web to encourage line breaks. It would make sense for Google&#8217;s language detection service to ignore spaces in Chinese character queries and that does seem to be the case. All tests I ran with Chinese character queries gave the same result with same confidence with and without spaces in random places.</p>

<h3>Does order matter?</h3>

<p><strong>No.</strong> This was slightly disappointing to see. I took the Japanese string &#8220;骨粗鬆症&#8221; (&#8216;osteoporosis&#8217;, if you&#8217;re curious) and ran every permutation against the language detector and got the same results, including the same confidence values. This is a clear indicator that Google uses only counting, not parsing, in their parser.</p>

<h3>Does repetition matter?</h3>

<p><strong>Yes.</strong> Now that it seems that Google does not use any parsing and only uses character frequencies in identifying the source language, let&#8217;s see how repetition can affect the detection service.</p>

<p>First, I took some Chinese character strings and ran them through the detection service with different numbers of repetitions, e.g. &#8220;参加&#8221;, &#8220;参加参加&#8221;, &#8220;参加参加参加&#8221;, &#8220;参加参加参加参加&#8221;&#8230; The queries I used were the following:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (traditional)</th><th>Japanese</th><th>Chinese (simplified)</th></tr>
<tr><th>木</th><td>X</td><td>X</td><td>X</td></tr>
<tr><th>漢字</th><td>X</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>氣</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>參加</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>参加</th><td>&nbsp;</td><td>X</td><td>X</td></tr>
<tr><th>気</th><td>&nbsp;</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>气</th><td>&nbsp;</td><td>&nbsp;</td><td>X</td></tr>
</table>

<p>For each token type, the detection service made up its mind quite quickly. Its confidence, however, was more interesting.</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-7.png" alt="" title="repetition vs. confidence" /></center></p>

<p>Each of the confidence values dips sharply after three, five, or ten repetitions. Note, however, the length of the tokens which dipped at each of those points. I interpret this to mean that <strong>there is a different parser for less than ten characters and ten or more characters.</strong> However, the detection service did not change its answer after this point on any of the tokens.</p>

<p>Second, I took two characters, &#8220;簡&#8221; and &#8220;体,&#8221; and crossed different numbers of them together to see how that would affect the language detected. Note that &#8220;簡&#8221; is used in traditional Chinese and Japanese, while &#8220;体&#8221; is used in simplified Chinese and Japanese.</p>

<p><style type="text/css">
table .zh { background-color: #e3d2d2; }
table .zh-Hant { background-color: #d3e3d2; }
table .ja { background-color: #d5d2e3; }
</style></p>

<table style="margin-left:auto;margin-right:auto;">
<tr><th>&nbsp;</th><th>簡x0</th><th>簡x1</th><th>簡x2</th><th>簡x3</th><th>簡x4</th><th>簡x5</th><th>簡x6</th><th>簡x7</th><th>簡x8</th><th>簡x9</th></tr>
<tr><th>体x0</th><td>&nbsp;</td> <td class='zh'>0.995</td> <td class='zh'>0.998</td> <td class='zh'>0.998</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x1</th><td class='zh-Hant'>0.995</td> <td class='ja'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x2</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.531</td> </tr>
<tr><th>体x3</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.52</td> <td class='ja'>0.568</td> </tr>
<tr><th>体x4</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.516</td> <td class='ja'>0.565</td> <td class='ja'>0.613</td> </tr>
<tr><th>体x5</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.512</td> <td class='ja'>0.561</td> <td class='ja'>0.609</td> <td class='ja'>0.657</td> </tr>
<tr><th>体x6</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.507</td> <td class='ja'>0.556</td> <td class='ja'>0.605</td> <td class='ja'>0.653</td> <td class='ja'>0.702</td> </tr>
<tr><th>体x7</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.502</td> <td class='ja'>0.551</td> <td class='ja'>0.6</td> <td class='ja'>0.649</td> <td class='ja'>0.697</td> <td class='ja'>0.746</td> </tr>
<tr><th>体x8</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>1</td> <td class='ja'>0.545</td> <td class='ja'>0.595</td> <td class='ja'>0.644</td> <td class='ja'>0.693</td> <td class='ja'>0.741</td> <td class='ja'>0.79</td> </tr>
<tr><th>体x9</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>1</td> <td class='ja'>0.539</td> <td class='ja'>0.589</td> <td class='ja'>0.638</td> <td class='ja'>0.687</td> <td class='ja'>0.736</td> <td class='ja'>0.785</td> <td class='ja'>0.834</td> </tr>
</table>

<table style="margin-left:auto;margin-right:auto;">
<tr><td class="ja">Japanese</td><td class='zh-Hant'>Chinese (traditional)</td><td class='zh'>Chinese (simplified)</td></tr>
</table>

<h3>Conclusion</h3>

<p>For Chinese character-based languages, Google&#8217;s language detection algorithm uses simple counting rather than parsing, identifying languages by looking at the <em>frequency of characters</em> rather than the <em>frequency of words</em>. As such, the algorithm simply acts as a <strong>script detector, not a language detector.</strong> Moreover, as a simple counting method is used, duplicating characters used in one language but not another can very easily skew the resulting output.</p>

<p>As a trivial aside, it seems that Google&#8217;s algorithm is slightly different for strings less than ten characters, as can be seen in a dip and then rise of confidence values after ten characters.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Just to complicate matters further, there&#8217;s also the issue of where you&#8217;re accessing Google from. For example, accessing from the US (or via my friend <a href="http://support.uchicago.edu/docs/network/vpn/">VPN</a>), a query for the Japanese-simplified &#8220;天気&#8221; seems to only return Japanese pages. However, accessing from Taiwan, Google assumes you may have meant the full-form &#8220;天氣&#8221;, giving you pages with both &#8220;天気&#8221; and &#8220;天氣&#8221;. As a result, Yahoo Japan weather is the first result from the US and third from Taiwan, while Yahoo Taiwan weather is first in Taiwan and doesn&#8217;t even show up from the US. This default character substitution in Taiwan is one of my least-favorite Google &#8220;features.&#8221;<br /><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1-300x256.png" alt="" title="picture-1"/></a><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2-300x256.png" alt="" title="picture-2"/></a><br />Similar effects can most likely be seen between the US and China. In the rest of this post, all queries will be made from the US.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/testing-googles-language-detection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

