blog

Posts Tagged ‘Google’

回収 vs. 収集 and Better Word Meanings Through Usage

Thursday, September 18th, 2008

Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:

  • 回収 may take things away from others when collecting while 収集 does not have that implication.
  • Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1

Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: Google.2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.

(more…)


  1. This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.” 

  2. Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the n-gram data they offer for research. 

  3. ”Collocation” on Wikipedia says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.” 

Free licenses for Mailplane 2.0—Mailplane 2.0 の無料ライセンズ

Tuesday, August 19th, 2008

I’ve written before about Mailplane, a high-quality Gmail client with some great Mac-specific features. I’ve been happy to be associated with the project as its Japanese localizer. I recently completed the localization for the upcoming version 2.0. As a result, I’ve received twenty free licenses for Mailplane 2.0 from the developer, Ruben Bakker. Email me if you’re interested in one, and keep your eyes peeled for the 2.0 gold release.

前にもここで話題にしたことはあるが、今日は Mailplane の新バージョンを発表しよう。 Mailplane は Mac 的な機能満載の Gmail クライアントで、Gmail 2.0 対応の最新バージョン (2.0) が近々リリースされる。 自分は Mailplane の日本語版担当なので開発者のルーベン・バッカーさんから Mailplane 2.0 の無料ライセンズを20件頂きました。欲しい方はこちらにメールしてください。

尚、日本ではもうすぐ Mailplane が MacFan で紹介されるとのこと。楽しみ!最後に、日本語版で問題があると思ったら、勝手に書き上げる前に直接教えてね。^^ お願いします。m(__)m

Testing Google’s Language Detection

Saturday, May 17th, 2008

google code

As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world’s popular languages. Google’s handling of languages of course isn’t perfect, however—in particular, I’ve been complaining to friends for a while about the weaknesses of Google’s handling of queries in Chinese character (漢字/汉字) scripts. In this post, I run some tests using Google’s Language Detection service to try to better understand its handling of Chinese character queries.

Background

Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the “CJKV”). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there’s Wikipedia or Ken Lunde’s CJKV Information Processing.

The problem

The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, Google doesn’t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.[^1]

(more…)


© 2006-2010 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress.
Entries (RSS) and Comments (RSS).
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.