<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; observation</title>
	<atom:link href="http://mitcho.com/blog/category/observation/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Sat, 11 Feb 2012 12:23:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-19719</generator>
		<item>
		<title>The new Apple campus and the Pentagon compared</title>
		<link>http://mitcho.com/blog/observation/apple-campus-pentagon/</link>
		<comments>http://mitcho.com/blog/observation/apple-campus-pentagon/#comments</comments>
		<pubDate>Wed, 08 Jun 2011 15:59:02 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Apple]]></category>
		<category><![CDATA[Cupertino]]></category>
		<category><![CDATA[Google Maps]]></category>
		<category><![CDATA[map]]></category>
		<category><![CDATA[Pentagon]]></category>

		<guid isPermaLink="false">http://mitcho.com/?p=4502</guid>
		<description><![CDATA[+ + = That is all. Related posts: 北京 Part 2: Summer Palace, bargaining, The Tree, and fried apple pie Related posts brought to you by Yet Another Related Posts Plugin.
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/travel/%e5%8c%97%e4%ba%ac-part-2-summer-palace-bargaining-the-tree-and-fried-apple-pie/' rel='bookmark' title='北京 Part 2: Summer Palace, bargaining, The Tree, and fried apple pie'>北京 Part 2: Summer Palace, bargaining, The Tree, and fried apple pie</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><span style="font-size:18px"><a href="http://maps.google.com/maps?f=q&#038;source=s_q&#038;hl=en&#038;geocode=&#038;sll=37.332563,-122.00901&#038;sspn=0.01167,0.02178&#038;g=infinite+loop,+cupertino&#038;ie=UTF8&#038;hq=&#038;ll=37.332443,-122.009246&#038;spn=0.01167,0.02178&#038;t=h&#038;z=16"><img style="vertical-align:middle;" src="http://mitcho.com/blog/wp-content/uploads/2011/06/cupertino-300x271.png" alt="" title="cupertino" width="150" height="136" class="alignnone size-medium wp-image-4505" /></a> + <a href="http://maps.google.com/maps?f=q&#038;source=s_q&#038;hl=en&#038;geocode=&#038;q=Pentagon,+Arlington,+VA&#038;aq=0&#038;sll=37.332443,-122.009246&#038;sspn=0.01167,0.02178&#038;ie=UTF8&#038;hq=Pentagon,+Arlington,+VA&#038;ll=38.870604,-77.055724&#038;spn=0.011427,0.02178&#038;t=h&#038;z=16"><img style="vertical-align:middle;" src="http://mitcho.com/blog/wp-content/uploads/2011/06/Screen-shot-2011-06-08-at-11.02.19-AM-300x279.png" alt="" title="Screen shot 2011-06-08 at 11.02.19 AM" width="150" height="140" class="alignnone size-medium wp-image-4503" /></a> + <a href="http://www.9to5mac.com/71080/steve-jobs-presents-ideas-for-new-apple-super-campus-to-cupertino-city-council/"><img style="vertical-align:middle;" src="http://mitcho.com/blog/wp-content/uploads/2011/06/before-after-apple-campus-hp-203x300.jpg" alt="" title="before-after-apple-campus-hp" width="150" height="226" class="alignnone size-medium wp-image-4504" /></a> =</span></p>

<p><img src="http://mitcho.com/blog/wp-content/uploads/2011/06/merged.jpg" alt="" title="Apple campus + Pentagon" width="650" height="588" class="alignnone size-full wp-image-4508" /></p>

<p>That is all.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/travel/%e5%8c%97%e4%ba%ac-part-2-summer-palace-bargaining-the-tree-and-fried-apple-pie/' rel='bookmark' title='北京 Part 2: Summer Palace, bargaining, The Tree, and fried apple pie'>北京 Part 2: Summer Palace, bargaining, The Tree, and fried apple pie</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/apple-campus-pentagon/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Voicemail from Jesse</title>
		<link>http://mitcho.com/blog/observation/voicemail-from-jesse/</link>
		<comments>http://mitcho.com/blog/observation/voicemail-from-jesse/#comments</comments>
		<pubDate>Sat, 03 Jul 2010 05:11:24 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[Google voice]]></category>
		<category><![CDATA[NLP]]></category>
		<category><![CDATA[transcription]]></category>

		<guid isPermaLink="false">http://mitcho.com/?p=3822</guid>
		<description><![CDATA[My friend Jesse left me a voicemail on my Google voice number. Here&#8217;s a demo of the fantastic transcription feature. Voicemail from Jesse from mitcho on Vimeo. Related posts: Foxkeh demos Ubiquity Parser: The Next Generation Changes to Ubiquity Parser 2 and the Playpen HookPress: Webhooks for WordPress Related posts brought to you by Yet [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/' rel='bookmark' title='Changes to Ubiquity Parser 2 and the Playpen'>Changes to Ubiquity Parser 2 and the Playpen</a></li>
<li><a href='http://mitcho.com/blog/projects/hookpress-webhooks-for-wordpress/' rel='bookmark' title='HookPress: Webhooks for WordPress'>HookPress: Webhooks for WordPress</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>My friend <a href="http://20bits.com">Jesse</a> left me a voicemail on my Google voice number. Here&#8217;s a demo of the fantastic transcription feature.</p>

<p><object width="649" height="243"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=13051068&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=13051068&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="649" height="243"></embed></object></p>

<p><a href="http://vimeo.com/13051068">Voicemail from Jesse</a> from <a href="http://vimeo.com/mitchoyoshitaka">mitcho</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/' rel='bookmark' title='Changes to Ubiquity Parser 2 and the Playpen'>Changes to Ubiquity Parser 2 and the Playpen</a></li>
<li><a href='http://mitcho.com/blog/projects/hookpress-webhooks-for-wordpress/' rel='bookmark' title='HookPress: Webhooks for WordPress'>HookPress: Webhooks for WordPress</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/voicemail-from-jesse/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Disgusting Word-formatted HTML and how to fix it</title>
		<link>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/</link>
		<comments>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/#comments</comments>
		<pubDate>Wed, 30 Dec 2009 21:29:44 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[HTML]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[MITWPL]]></category>
		<category><![CDATA[Office]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[word]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=3287</guid>
		<description><![CDATA[In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books&#8217; abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/markdown-for-wordpress-and-bbpress/' rel='bookmark' title='Markdown for WordPress and bbPress'>Markdown for WordPress and bbPress</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books&#8217; abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:</p>

<ol>
<li><strong>Confusing <code>id</code>s and <code>class</code>es.</strong> <code>id</code>s should be unique on the page&#8230; but here&#8217;s an instance of using multiple instances of the same <code>id</code> in order to format them together.<br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;div id=&quot;indent&quot;&gt; &lt;div id=&quot;number&quot;&gt;4.2.1&lt;/div&gt; &lt;div id=&quot;page&quot;&gt;161&lt;/div&gt; &lt;div id=&quot;section&quot;&gt;Old French (Adams 1987)&lt;/div&gt;
&lt;/div&gt; &lt;div id=&quot;indent&quot;&gt; &lt;div id=&quot;number&quot;&gt;4.2.2&lt;/div&gt; &lt;div id=&quot;page&quot;&gt;164&lt;/div&gt; &lt;div id=&quot;section&quot;&gt;The evolution of the dialects of northern Italy&lt;/div&gt;</pre></div></div>


<ol>
<li><strong>Putting a class on every instance of something.</strong> Everything paragraph should be formatted equivalently. We get the point.<br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;&lt;b&gt;The English Noun Phrase in Its Sentential Aspect&lt;/b&gt;&lt;/p&gt;
&lt;p class=MsoNormal&gt;Steven Paul Abney&lt;/p&gt;
&lt;p class=MsoNormal&gt;May 1987&lt;/p&gt;</pre></div></div>


<ol>
<li><strong>Using blank space for formatting.</strong>  <br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;&lt;o:p&gt;&amp;amp;nbsp;&lt;/o:p&gt;&lt;/p&gt;</pre></div></div>


<ol>
<li><strong>CSS styles that don&#8217;t exist.</strong> Browsers just ignore these anyway&#8230; <br/></li>
</ol>


<div class="wp_syntax"><div class="code"><pre class="html" style="font-family:monospace;">&lt;p class=MsoNormal&gt;One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.&lt;span
style='mso-spacerun:yes'&gt;  &lt;/span&gt;It is well-known that a higher tense affects
the accessibility relation of modals.&lt;span style='mso-spacerun:yes'&gt; 
&lt;/span&gt;What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.&lt;span style='mso-spacerun:yes'&gt; 
&lt;/span&amp;gt</pre></div></div>


<h3>The solution</h3>

<p>My solution was to write a perl script which takes care of a number of these issues. It&#8217;s not foolproof and doesn&#8217;t involve any voodoo—for example, it can&#8217;t retypeset things which were formatted using whitespace—but it does a good job as a first pass.</p>

<div class="files">
<div class="file">
<a href="http://mitcho.com/blog/wp-content/uploads/2009/12/cleanwordhtml.pl_.txt">cleanwordhtml.pl</a><br/>
<span class="specs">perl</span>
</div>
</div>

<p>You can run the script by making it executable (<code>chmod +x cleanwordhtml.pl</code>) then specifying a target filename as an argument. For example,</p>


<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">.<span style="color: #000000; font-weight: bold;">/</span>cleanwordhtml.pl source.html <span style="color: #000000; font-weight: bold;">&gt;</span> clean.html</pre></div></div>


<p>I used this with a simple bash for loop to run over all my files:</p>


<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">for</span> f <span style="color: #000000; font-weight: bold;">in</span> <span style="color: #000000; font-weight: bold;">*/*</span>.html; <span style="color: #000000; font-weight: bold;">do</span> .<span style="color: #000000; font-weight: bold;">/</span>cleanwordhtml.pl <span style="color: #007800;">$f</span> <span style="color: #000000; font-weight: bold;">&gt;</span> <span style="color: #800000;">${f%.html}</span>-clean.html; <span style="color: #000000; font-weight: bold;">done</span>;</pre></div></div>


<p>Hopefully someone else can benefit from my experience.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/markdown-for-wordpress-and-bbpress/' rel='bookmark' title='Markdown for WordPress and bbPress'>Markdown for WordPress and bbPress</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/disgusting-word-formatted-html-and-how-to-fix-it/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>My friend Evan bought an iPhone</title>
		<link>http://mitcho.com/blog/observation/my-friend-evan-bought-an-iphone/</link>
		<comments>http://mitcho.com/blog/observation/my-friend-evan-bought-an-iphone/#comments</comments>
		<pubDate>Sun, 13 Dec 2009 20:51:16 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Evan]]></category>
		<category><![CDATA[iPhone]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=3206</guid>
		<description><![CDATA[Related posts: Oh Evan ワンセグ TV coming to the iPhone RISK on the iPhone Related posts brought to you by Yet Another Related Posts Plugin.
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/oh-evan/' rel='bookmark' title='Oh Evan'>Oh Evan</a></li>
<li><a href='http://mitcho.com/blog/link/%e3%83%af%e3%83%b3%e3%82%bb%e3%82%b0-tv-coming-to-the-iphone/' rel='bookmark' title='ワンセグ TV coming to the iPhone'>ワンセグ TV coming to the iPhone</a></li>
<li><a href='http://mitcho.com/blog/life/risk-on-the-iphone/' rel='bookmark' title='RISK on the iPhone'>RISK on the iPhone</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2009/12/tweeting-3.png" alt="tweeting-3.png" border="0" width="324" height="352" /></center></p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/oh-evan/' rel='bookmark' title='Oh Evan'>Oh Evan</a></li>
<li><a href='http://mitcho.com/blog/link/%e3%83%af%e3%83%b3%e3%82%bb%e3%82%b0-tv-coming-to-the-iphone/' rel='bookmark' title='ワンセグ TV coming to the iPhone'>ワンセグ TV coming to the iPhone</a></li>
<li><a href='http://mitcho.com/blog/life/risk-on-the-iphone/' rel='bookmark' title='RISK on the iPhone'>RISK on the iPhone</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/my-friend-evan-bought-an-iphone/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Mozilla By The Numbers</title>
		<link>http://mitcho.com/blog/projects/mozilla-by-the-numbers/</link>
		<comments>http://mitcho.com/blog/projects/mozilla-by-the-numbers/#comments</comments>
		<pubDate>Sun, 06 Sep 2009 04:26:54 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[reflection]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=2844</guid>
		<description><![CDATA[About six months ago I started working for Mozilla Labs full-time, focusing on Ubiquity, the multilingual natural language interface for the browser. This week marked my last week on contract as I go back to grad school next week. While the work will go on and I hope to continue to stay involved as time [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/report-from-mozilla-party-jp-10/' rel='bookmark' title='Report from Mozilla Party JP 10!'>Report from Mozilla Party JP 10!</a></li>
<li><a href='http://mitcho.com/blog/projects/this-week-on-ubiquity-parser-the-next-generation/' rel='bookmark' title='This week on Ubiquity Parser: The Next Generation'>This week on Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/life/notes-from-barcamp-tokyo-2009/' rel='bookmark' title='Notes from BarCamp Tokyo 2009'>Notes from BarCamp Tokyo 2009</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>About six months ago <a href="http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/">I started working</a> for Mozilla Labs full-time, focusing on <a href="http://ubiquity.mozilla.com">Ubiquity</a>, the multilingual natural language interface for the browser. This week marked my last week on contract as I go back to <a href="http://web.mit.edu/linguistics/">grad school</a> next week. While the work will go on and I hope to continue to stay involved as time allows, here&#8217;s a quick bird&#8217;s eye view of my activities in my Mozilla tenure:</p>

<hr/>

<p>Time working for Mozilla: 6.5 months</p>

<p>Mozilla-related blog posts written: <a href="http://mitcho.com/blog/tag/mozilla-planet">69</a></p>

<p>Academic papers written on Ubiquity: <a href="http://mitcho.com/academic/erlewine-sigir.pdf">1</a></p>

<p>Ubiquity presentations given: <a href="http://www.slideshare.net/mitcho">5</a></p>

<p>Screencasts made: <a href="http://vimeo.com/mitchoyoshitaka/videos">8</a></p>

<p>Most popular video on Vimeo: <a href="http://vimeo.com/5420966">Ubiquity 0.5 日本語紹介ビデオ</a>, the Japanese Ubiquity 0.5 introduction video: 2252 views</p>

<p>Languages Ubiquity commands and parser now support: 6</p>

<p>Commits to the <a href="https://ubiquity.mozilla.com/hg/ubiquity-firefox/">Ubiquity repository</a>: 492</p>

<p>Other web projects started during this period: 2+ (<a href="http://tengrandisburiedthere.com">Ten Grand Is Buried There</a>, <a href="http://mitcho.com/code/hookpress/">HookPress</a>)</p>

<p>TechCrunch references: 2 (<a href="http://www.techcrunch.com/2009/06/10/geeksonaplane-meet-tokyo-20-learn-about-the-relation-between-the-web-language/">1</a>, <a href="http://www.techcrunch.com/2009/06/18/mozilla-shows-microsoft-where-10000-is-buried/">2</a>)</p>

<p>Countries worked in: 2</p>

<p>Mythical Kiwis worked with: <a href="http://theunfocused.net/">1</a></p>

<p>References to bugs I introduced as &#8220;glitcho&#8221;s: <a href="https://ubiquity.mozilla.com/hg/ubiquity-firefox/rev/79d40b35ea2b">1</a></p>

<p>Extremely disturbing homages to me and <a href="http://dl-client.getdropbox.com/u/10320/django/wallpaper/magic-pony-django-wallpaper.png">Django</a>: <a href="http://users.skumleren.net/cers/mitchopony.png">1</a></p>

<p>Friends made; experience gained; lessons on Open-ness learned; personal growth: <strike>priceless</strike> enumerable</p>

<hr/>

<p>Thanks to all who made this experience amazing, beginning with Aza, Jono, Atul, Blair and the rest of the Labs team; intern extraordinaire Brandon; the always thoughtful and friendly <a href="http://mozilla.jp">Mozilla Japan team</a>; and of course the <a href="http://groups.google.com/group/ubiquity-firefox">fantastic Ubiquity community</a>! Please visit me in Boston—I should be around for a while. <img src='http://mitcho.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/report-from-mozilla-party-jp-10/' rel='bookmark' title='Report from Mozilla Party JP 10!'>Report from Mozilla Party JP 10!</a></li>
<li><a href='http://mitcho.com/blog/projects/this-week-on-ubiquity-parser-the-next-generation/' rel='bookmark' title='This week on Ubiquity Parser: The Next Generation'>This week on Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/life/notes-from-barcamp-tokyo-2009/' rel='bookmark' title='Notes from BarCamp Tokyo 2009'>Notes from BarCamp Tokyo 2009</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/mozilla-by-the-numbers/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Scoring for Optimization</title>
		<link>http://mitcho.com/blog/observation/scoring-for-optimization/</link>
		<comments>http://mitcho.com/blog/observation/scoring-for-optimization/#comments</comments>
		<pubDate>Fri, 24 Apr 2009 09:51:31 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[candidates]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[harmonic analysis]]></category>
		<category><![CDATA[math]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[order]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[score]]></category>
		<category><![CDATA[suggestions]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1850</guid>
		<description><![CDATA[Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate&#8217;s score. You&#8217;re only interested in the top candidates. You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/' rel='bookmark' title='Scoring and Ranking Suggestions'>Scoring and Ranking Suggestions</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/this-week-on-ubiquity-parser-the-next-generation/' rel='bookmark' title='This week on Ubiquity Parser: The Next Generation'>This week on Ubiquity Parser: The Next Generation</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate&#8217;s score. You&#8217;re only interested in the top <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates. <strong>You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier without sacrificing quality.</strong> Such is <a href="http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/">the problem of scoring and ranking suggestions in Ubiquity</a>. What properties must such a scoring system have?</p>

<p><em>This blog post includes a lot of complex CSS-formatted graphs which may be best viewed in — what else? — <a href="http://mozilla.com">Firefox</a>. You may also want to <a href="http://mitcho.com/blog/observation/scoring-for-optimization/">access this blog post directly</a> rather than through a planet.</em></p>

<p><style type='text/css'>
.mitchostable, .mitchostable tr, .mitchostable td, .mitchostable th {
  border:0;
  margin:0;
  padding:1px;
  background-color: transparent;
  text-align:left;
}
tr.cutoff th, tr.cutoff td { border-bottom: 1px #666 solid }
tr.cutoff td.cutoff {
  font-style: italic;
  font-size: 0.8em;
  color: #666;
  border: 0;
}
.mitchostable img { height: 7px }
.mitchostable span.bar { 
  background-color: #ccc;
  display: inline-block;
  height: 7px;
}
.mitchostable span.arrow-right { 
  background: #ccc url(http://mitcho.com/i/cccarrow-right.png) no-repeat scroll center right;
  display: inline-block;
  height: 7px;
}
.mitchostable span.arrow-left { 
  background: #ccc url(http://mitcho.com/i/cccarrow-left.png) no-repeat scroll center left;
  display: inline-block;
  height: 7px;
}
.mitchostable span.bound-right { 
  background: transparent url(http://mitcho.com/i/bound-right.png) no-repeat scroll center right;
  display: inline-block;
  height: 7px;
}
.mitchostable.threshold {
  background: transparent url(http://mitcho.com/i/000.png) repeat-y scroll 180px 0px;
}
.mitchostable.threshold2 {
  background: transparent url(http://mitcho.com/i/000.png) repeat-y scroll 70px 0px;
}
.mitchostable.threshold *, .mitchostable.threshold2 * {
  background: transparent;
}</p>

<p></style></p>

<table border='0' class='mitchostable'>

<tr><th>candidate 8</th><td><span class='bar' style='width:180px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 2</th><td><span class='bar' style='width:166px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 9</th><td><span class='bar' style='width:123px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 3</th><td><span class='bar' style='width:107px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr class='cutoff'><th>candidate 10</th><td><span class='bar' style='width:96px'>&nbsp;</span></td><td rowspan='2' class='cutoff'>CUTOFF</td></tr>

<tr><th>candidate 5</th><td><span class='bar' style='width:70px'>&nbsp;</span></td></tr>
<tr><th>candidate 1</th><td><span class='bar' style='width:50px'>&nbsp;</span></td></tr>
<tr><th>candidate 7</th><td><span class='bar' style='width:43px'>&nbsp;</span></td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table>

<p>One portion of the problem description above merits clarification: I define &#8220;without sacrificing quality&#8221; to mean that, if we did not throw out any candidates early and waited until all the scores are computed fully and accurately, we would still yield the same top <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> winners. This already gives us the key insight towards an appropriate solution: <em>we can only throw out candidates when we know that it has no further chance of making it up into top <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates.</em></p>

<p><span id="more-1850"></span></p>

<h3>Let&#8217;s get formal</h3>

<p>Let&#8217;s call <img src='http://s0.wp.com/latex.php?latex=S_%7Bi%7D%28t%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{i}(t)' title='S_{i}(t)' class='latex' /> the score of candidate <img src='http://s0.wp.com/latex.php?latex=C_%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='C_{i}' title='C_{i}' class='latex' /> at time <img src='http://s0.wp.com/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t' title='t' class='latex' /> in the derivation and we&#8217;ll assume that the score derivations are done in parallel with a unique origin (<img src='http://s0.wp.com/latex.php?latex=t%3D0&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=0' title='t=0' class='latex' />).<sup id="fnref:2"><a href="#fn:2" rel="footnote">1</a></sup> We&#8217;ll use the notation <img src='http://s0.wp.com/latex.php?latex=S_%7Bi%7D%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{i}(&#92;infty)' title='S_{i}(&#92;infty)' class='latex' /> to represent the equilibrium or final score, equal to <img src='http://s0.wp.com/latex.php?latex=S_%7Bi%7D%28t%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{i}(t)' title='S_{i}(t)' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=t+%3E+&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t &gt; ' title='t &gt; ' class='latex' /> a certain <img src='http://s0.wp.com/latex.php?latex=t%5E%7B%5Cprime%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t^{&#92;prime}' title='t^{&#92;prime}' class='latex' /> which exists for each candidate. This function <img src='http://s0.wp.com/latex.php?latex=S_%7Bi%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{i}' title='S_{i}' class='latex' /> thus defines a <a href="http://en.wikipedia.org/wiki/time series">time series</a> for each candidate.</p>

<p>Given a set of candidates <img src='http://s0.wp.com/latex.php?latex=%5Cleft%5C%7BC_1%2CC_2%2C%5Cldots%2CC_k%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;left&#92;{C_1,C_2,&#92;ldots,C_k&#92;right&#92;}' title='&#92;left&#92;{C_1,C_2,&#92;ldots,C_k&#92;right&#92;}' class='latex' />, we want to find the best subset of <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates; that is, <img src='http://s0.wp.com/latex.php?latex=%5Cleft%5C%7BC_%7Bi_1%7D%2CC_%7Bi_2%7D%2C%5Cldots%2CC_%7Bi_n%7D%5Cright%5C%7D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;left&#92;{C_{i_1},C_{i_2},&#92;ldots,C_{i_n}&#92;right&#92;}' title='&#92;left&#92;{C_{i_1},C_{i_2},&#92;ldots,C_{i_n}&#92;right&#92;}' class='latex' /> such that</p>

<p><center><img src='http://s.wordpress.com/latex.php?latex=%5Cdisplaystyle%20%5Cforall_%7B%20i%5Cin%20%5C%7Bi_1%2C%5Cdots%2Ci_n%5C%7D%2C%20j%5Cin%20%5C%7B1%2C%5Cdots%2Ck%5C%7D%5Csetminus%5C%7Bi_1%2C%5Cldots%2Ci_n%5C%7D%7D%20S_%7Bi%7D%28%5Cinfty%29%20%5Cgeq%20S_%7Bj%7D%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='\forall_{ i\in \{i_1,\dots,i_n\}, j\in \{1,\dots,k\}\setminus\{i_1,\ldots,i_n\}} S_{i}(\infty) \geq S_{j}(\infty)'/>.</center></p>

<h3>Approach 1: A Threshold Model</h3>

<p>The key insight above would naturally give us what I call the threshold model. Here, we require the score sequences to be non-increasing: <img src='http://s0.wp.com/latex.php?latex=%5Cforall_%7Bt+%3C+t%5E%7B%5Cprime%7D%7D+S_%7Bi%7D%28t%29+%3C+S_%7Bi%7D%28t%5E%7B%5Cprime%7D%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;forall_{t &lt; t^{&#92;prime}} S_{i}(t) &lt; S_{i}(t^{&#92;prime})' title='&#92;forall_{t &lt; t^{&#92;prime}} S_{i}(t) &lt; S_{i}(t^{&#92;prime})' class='latex' />. This way, we can naturally throw out candidates which have reached below a certain threshold <img src='http://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='M' title='M' class='latex' /> (or attained a certain level of badness, you might say) which we can then be sure will never recover.</p>

<p>For example, suppose the following diagram represents the scores of five different candidates after the first four time steps of the derivation. (The full gray bar marks the initial score (<img src='http://s0.wp.com/latex.php?latex=S_i%280%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_i(0)' title='S_i(0)' class='latex' />) and the arrows indicate the successive score differentials.) The vertical line marks the threshold, <img src='http://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='M' title='M' class='latex' />.</p>

<table border='0' class='mitchostable threshold'>
<tr><th>candidate 1</th><td><span class='bar' style='width:130px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span><span class='arrow-left' style='width:13px'>&nbsp;</span><span class='arrow-left' style='width:8px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 2</th><td><span class='bar' style='width:80px'>&nbsp;</span><span class='arrow-left' style='width:50px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 3</th><td><span class='bar' style='width:110px'>&nbsp;</span><span class='arrow-left' style='width:30px'>&nbsp;</span><span class='arrow-left' style='width:27px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 4</th><td><span class='bar' style='width:53px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span><span class='arrow-left' style='width:50px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>candidate 5</th><td><span class='bar' style='width:114px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:6px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table>

<p>We can tell after four steps that candidates 2 and 4, given that the score sequences are non-increasing, have no chance to finish their derivation with a score <img src='http://s0.wp.com/latex.php?latex=%3E+M&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&gt; M' title='&gt; M' class='latex' />. What is important to note, however, is that <em>candidate 4 already had no chance of beating the threshold after three steps.</em> <strong>There was no need to calculate the fourth derivation of the score of candidate 4</strong> (<img src='http://s0.wp.com/latex.php?latex=S_%7B4%7D%284%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{4}(4)' title='S_{4}(4)' class='latex' />). In other words, after three steps, we could completely take candidate 4 out of the running and after another step, take candidate 2 out of the running.</p>

<table>
<tr><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D2&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=2' title='t=2' class='latex' /></td><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D3&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=3' title='t=3' class='latex' /></td><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D4&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=4' title='t=4' class='latex' /></td></tr>
<tr>
<td><table border='0' class='mitchostable threshold2'>
<tr><th>C1</th><td><span class='bar' style='width:113px'>&nbsp;</span><span class='arrow-left' style='width:8px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C2</th><td><span class='bar' style='width:83px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:117px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C4</th><td><span class='bar' style='width:73px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C5</th><td><span class='bar' style='width:70px'>&nbsp;</span><span class='arrow-left' style='width:6px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
<td><table border='0' class='mitchostable threshold2'>
<tr><th>C1</th><td><span class='bar' style='width:100px'>&nbsp;</span><span class='arrow-left' style='width:13px'>&nbsp;</span><span class='arrow-left' style='width:8px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C2</th><td><span class='bar' style='width:80px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:90px'>&nbsp;</span><span class='arrow-left' style='width:27px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C4</strike></th><td><span class='bar' style='width:23px'>&nbsp;</span><span class='arrow-left' style='width:50px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C5</th><td><span class='bar' style='width:67px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:6px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
<td><table border='0' class='mitchostable threshold2'>
<tr><th>C1</th><td><span class='bar' style='width:80px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span><span class='arrow-left' style='width:13px'>&nbsp;</span><span class='arrow-left' style='width:8px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C2</strike></th><td><span class='bar' style='width:30px'>&nbsp;</span><span class='arrow-left' style='width:50px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:20px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:60px'>&nbsp;</span><span class='arrow-left' style='width:30px'>&nbsp;</span><span class='arrow-left' style='width:27px'>&nbsp;</span><span class='arrow-left' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C4</strike></th><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>C5</th><td><span class='bar' style='width:64px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:3px'>&nbsp;</span><span class='arrow-left' style='width:6px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
</tr>
</table>

<p>This non-decreasing score approach was used in Ubiquity Parser 2 until just recently, and you can in fact still play with it on the <a href="http://mitcho.com/code/ubiquity/parser-demo/">online Ubiquity Parser TNG demo</a>. In that version, every parse started with an initial score of 1 and every score factor would be a value between 0 and 1. Every score factor was multiplied onto the previous score throughout the derivation, making it trivially non-increasing.</p>

<p><strong>The problem with this approach</strong> is how to choose a smart threshold and that, given a constant threshold, you may get a different number of results for every different candidate set (i.e. parser query). If your score indicates a meaningful value with an a priori specified target of acceptable values, having a threshold makes sense. In the case of Ubiquity, however, the interface expects a certain number of suggestions to be returned.<sup id="fnref:1"><a href="#fn:1" rel="footnote">2</a></sup> If we plan to display five suggestions but the parser only returns four, even though there were other candidates, there must be a very good reason and justification for that threshold value.</p>

<h3>Approach 2: Raising the Bar</h3>

<p>The problem with Approach 1 was that there was no way of guaranteeing that we would yield our predefined <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> winning candidates. Even if at some point in the derivation we are left with <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates still above the threshold, as the only restriction we have is that our score series are non-increasing, there is still a possibility that those remaining <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates&#8217; scores will drop below <img src='http://s0.wp.com/latex.php?latex=M&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='M' title='M' class='latex' /> later in the derivation.</p>

<p>We must instead at some point in the derivation identify <strong>(a)</strong> a set of at least <img src='http://s0.wp.com/latex.php?latex=n&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n' title='n' class='latex' /> candidates which will not get &#8220;worse&#8221; in the derivation and <strong>(b)</strong> candidates which have no chance of overtaking the (a) candidates. In this situation we can safely throw out the (b) candidates.</p>

<p>One way to do this is to require that all the scores <img src='http://s0.wp.com/latex.php?latex=S_%7Bi%7D%28t%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_{i}(t)' title='S_{i}(t)' class='latex' /> are <strong>bounded and non-decreasing</strong>. By virtue of being non-decreasing, our top candidates at any point in our derivation will never get &#8220;worse&#8221; afterwards, satisfying condition (a). If relatively early in the computation we can compute a bound <img src='http://s0.wp.com/latex.php?latex=B_i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='B_i' title='B_i' class='latex' />, we can identify candidates which will never surpass the top candidates in group (a) above, satisfying condition (b).</p>

<p>In the example below, <img src='http://s0.wp.com/latex.php?latex=n%3D2&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='n=2' title='n=2' class='latex' /> and the thin bars mark the upper bounds <img src='http://s0.wp.com/latex.php?latex=B_i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='B_i' title='B_i' class='latex' />. At <img src='http://s0.wp.com/latex.php?latex=t%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=1' title='t=1' class='latex' /> we can identify candidate 2 and 4 as being our top two candidates. Note that there is one candidate, candidate 5, whose upper bound <img src='http://s0.wp.com/latex.php?latex=B_5&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='B_5' title='B_5' class='latex' /> is less than both <img src='http://s0.wp.com/latex.php?latex=S_2%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_2(1)' title='S_2(1)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=S_4%281%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_4(1)' title='S_4(1)' class='latex' />. By definition <img src='http://s0.wp.com/latex.php?latex=S_5%28%5Cinfty%29+%5Cleq+B_5&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_5(&#92;infty) &#92;leq B_5' title='S_5(&#92;infty) &#92;leq B_5' class='latex' /> and because the scores are non-decreasing <img src='http://s0.wp.com/latex.php?latex=S_2%281%29+%5Cleq+S_2%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_2(1) &#92;leq S_2(&#92;infty)' title='S_2(1) &#92;leq S_2(&#92;infty)' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=S_4%281%29+%5Cleq+S_4%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_4(1) &#92;leq S_4(&#92;infty)' title='S_4(1) &#92;leq S_4(&#92;infty)' class='latex' />. Therefore</p>

<p><center><img src='http://s.wordpress.com/latex.php?latex=S_5%28%5Cinfty%29%20%3C%20S_2%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='S_5(\infty) < S_2(\infty)'/> and <img src='http://s.wordpress.com/latex.php?latex=S_5%28%5Cinfty%29%20%3C%20S_4%28%5Cinfty%29&#038;bg=ffffff&#038;fg=000000&#038;s=1' alt='S_5(\infty) < S_4(\infty)'/></center></p>

<p>and we can thus throw out candidate 5 at this point. By the same logic, after <img src='http://s0.wp.com/latex.php?latex=t%3D2&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=2' title='t=2' class='latex' /> we can throw candidate 2 out of the running.</p>

<table>
<tr><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D1&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=1' title='t=1' class='latex' /></td><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D2&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=2' title='t=2' class='latex' /></td><td colspan='2'><img src='http://s0.wp.com/latex.php?latex=t%3D3&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=3' title='t=3' class='latex' /></td></tr>
<tr>
<td><table border='0' class='mitchostable'>
<tr><th>C1</th><td><span class='bar' style='width:28px'>&nbsp;</span><span class='bound-right' style='width:70px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C2</th><td><span class='bar' style='width:59px'>&nbsp;</span><span class='bound-right' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:49px'>&nbsp;</span><span class='bound-right' style='width:40px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C4</th><td><span class='bar' style='width:83px'>&nbsp;</span><span class='bound-right' style='width:15px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C5</strike></th><td><span class='bar' style='width:56px'>&nbsp;</span><span class='bound-right' style='width:6px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
<td><table border='0' class='mitchostable'>
<tr><th>C1</th><td><span class='bar' style='width:28px'>&nbsp;</span><span class='arrow-right' style='width:56px'>&nbsp;</span><span class='bound-right' style='width:14px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C2</strike></th><td><span class='bar' style='width:59px'>&nbsp;</span><span class='arrow-right' style='width:5px'>&nbsp;</span><span class='bound-right' style='width:10px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:49px'>&nbsp;</span><span class='arrow-right' style='width:20px'>&nbsp;</span><span class='bound-right' style='width:20px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C4</th><td><span class='bar' style='width:83px'>&nbsp;</span><span class='arrow-right' style='width:6px'>&nbsp;</span><span class='bound-right' style='width:9px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C5</strike></th><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
<td><table border='0' class='mitchostable'>
<tr><th>C1</th><td><span class='bar' style='width:28px'>&nbsp;</span><span class='arrow-right' style='width:56px'>&nbsp;</span><span class='arrow-right' style='width:4px'>&nbsp;</span><span class='bound-right' style='width:10px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C2</strike></th><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>C3</th><td><span class='bar' style='width:49px'>&nbsp;</span><span class='arrow-right' style='width:20px'>&nbsp;</span><span class='arrow-right' style='width:15px'>&nbsp;</span><span class='bound-right' style='width:5px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th>C4</th><td><span class='bar' style='width:83px'>&nbsp;</span><span class='arrow-right' style='width:6px'>&nbsp;</span><span class='arrow-right' style='width:6px'>&nbsp;</span><span class='bound-right' style='width:3px'>&nbsp;</span></td><td>&nbsp;</td></tr>
<tr><th><strike>C5</strike></th><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>&#8230;</th><td>&nbsp;</td><td>&nbsp;</td></tr>
</table></td><td>→</td>
</tr>
</table>

<p>Calling this the &#8220;raising the bar&#8221; method refers to the fact that, at any particular time <img src='http://s0.wp.com/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t' title='t' class='latex' />, the &#8220;bar&#8221; is <img src='http://s0.wp.com/latex.php?latex=min%5Cleft%28%5Cleft%5C%7B%5Cmbox%7Bthe+%7Dn%5Cmbox%7B+greatest+%7DS_%7Bi%7D%28t%29%5Cmbox%7B+values%7D%5Cright%5C%7D%5Cright%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='min&#92;left(&#92;left&#92;{&#92;mbox{the }n&#92;mbox{ greatest }S_{i}(t)&#92;mbox{ values}&#92;right&#92;}&#92;right)' title='min&#92;left(&#92;left&#92;{&#92;mbox{the }n&#92;mbox{ greatest }S_{i}(t)&#92;mbox{ values}&#92;right&#92;}&#92;right)' class='latex' /> and every other candidate must have an upper bound <img src='http://s0.wp.com/latex.php?latex=B_j&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='B_j' title='B_j' class='latex' /> greater than the bar in order to not be thrown out of consideration. This &#8220;bar&#8221; itself is, together with the component scores, non-decreasing, decreasing the number of surviving candidates over time.</p>

<p>In the case of <a href="http://mitcho.com/blog/projects/a-demonstration-of-ubiquity-parser-2/">the Ubiquity parser</a> we could build such a non-decreasing and bounded scoring model by using an additive model. As the main component of parser scoring is <a href="https://ubiquity.mozilla.com/trac/ticket/435">how well the parsed arguments match the verbs&#8217; specified nountypes</a>, we could simply add up all the confidence scores of each nountype suggestion, each of which are a value between 0 and 1. This would trivially be non-decreasing. As each parse has a finite and known number of parsed arguments, we could easily determine a bound as well. For example, say a parse <img src='http://s0.wp.com/latex.php?latex=S_0&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_0' title='S_0' class='latex' /> has two arguments. Before we check each of the nountypes&#8217; match scores, we already know that <img src='http://s0.wp.com/latex.php?latex=S_0%28%5Cinfty%29+%5Cleq+2+%3D+B_0&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='S_0(&#92;infty) &#92;leq 2 = B_0' title='S_0(&#92;infty) &#92;leq 2 = B_0' class='latex' />.</p>

<p>Unfortunately, there are also other factors which we would like to consider in our parses which may not fit into this non-decreasing model so easily&#8230;</p>

<h3>Approach 2&#8217;: The Rising Sun Model<sup id="fnref:3"><a href="#fn:3" rel="footnote">3</a></sup></h3>

<p>One problem with both of the previous approaches is that it requires that the scoring schemes be either non-increasing or non-decreasing across the derivation. There are many situations, however, where you would want different factors to affect the score both positively and negatively. In the case of the Ubiquity parser, here are some different factors which could be good positive and negative score factors in computing the score of each parse.</p>

<table>
<tr><th>positive factors</th><th>negative factors</th></tr>
<tr><td>the verb&#8217;s specified nountype matching the argument noun well</td><td>having to suggest the verb</td></tr>
<tr><td>the verb in the input matching the verb well</td><td>multiple arguments parsed for a single <a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/'>semantic role</a></td></tr>
<tr><td>the verb being used often</td><td>the verb missing some arguments</td></tr>
</table>

<p>As we see, there are both positive and negative factors which we hope to consider in scoring our possible Ubiquity parses. They key to making this work is by noting that Approach 2 only requires that the scoring series be bounded and non-decreasing <em>after a certain known time in the derivation</em>. For example, even if a parse involves a number of decreases early in the parse derivation, if after a certain point we can be certain that it is non-decreasing and bounded, we can simply use that bound and start eliminating poor candidates at that time (in this example, after <img src='http://s0.wp.com/latex.php?latex=t%3D2&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t=2' title='t=2' class='latex' />).</p>

<p><style type='text/css'>
.mitchostable2, .mitchostable2 tr, .mitchostable2 td, .mitchostable2 th {
  border:0;
  margin:0;
  padding:1px;
  text-align:left;
  vertical-align: bottom;
}
.mitchostable2 {
  background: transparent url(http://mitcho.com/i/000.png) repeat-x 0px 57px;
}
.mitchostable2 * {
  background: transparent;
}
.mitchostable2 span.bar { 
  background-color: #ccc;
  display: inline-block;
  width: 7px;
}
</style></p>

<table border='0' class='mitchostable2'>
<tr>
<td><span class='bar' style='height:150px'>&nbsp;</span></td><td><span class='bar' style='height:120px'>&nbsp;</span></td><td><span class='bar' style='height:90px'>&nbsp;</span></td><td><span class='bar' style='height:50px'>&nbsp;</span></td><td><span class='bar' style='height:60px'>&nbsp;</span></td><td><span class='bar' style='height:72px'>&nbsp;</span></td><td><span class='bar' style='height:80px'>&nbsp;</span></td><td><span class='bar' style='height:82px'>&nbsp;</span></td><td><span class='bar' style='height:90px'>&nbsp;</span></td><td><span class='bar' style='height:92px'>&nbsp;</span></td><td><span class='bar' style='height:92px'>&nbsp;</span></td><td><span class='bar' style='height:93px'>&nbsp;</span></td><td><span class='bar' style='height:93px'>&nbsp;</span></td><td><span class='bar' style='height:94px'>&nbsp;</span></td>
</tr>
<tr><td>0</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>5</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td>&nbsp;</td><td colspan="3">10</td><td>&nbsp;</td></tr>
</table>

<p>This is very much possible in the Ubiquity parser as, given the <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">Ubiquity Parser 2 design</a>, the negative factors such as whether the parse has a verb from the input or not (step 2), whether multiple arguments are identified with the same semantic role (step 4), and how many of the verb&#8217;s arguments are in the input (step 4) can be identified early on in the derivation, all before the very computationally intensive step of nountype detection (step 7) and argument suggestion (step 8). In this way, we can front-load all the negative factors in scoring and continue to use a version of Approach 2 to optimize our parsing.</p>

<p>We can moreover make the effect of the negative factors be felt across the entire derivation by figuring the negative factors into a factor between 0 and 1 and multiplying it onto each of the positive factors being added. In other words, we can compute all the negative factors into a single <strong>score multiplier</strong> <img src='http://s0.wp.com/latex.php?latex=%5Cmu_i+%5Cin+%5B0%2C1%5D&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;mu_i &#92;in [0,1]' title='&#92;mu_i &#92;in [0,1]' class='latex' /> earlier in the derivation and then afterwards when adding up each of the positive factors simply applying that score multiplier to the score derivation:</p>

<p><center><img src='http://s0.wp.com/latex.php?latex=%5Cmu_%7Bi%7D%28%5Cmbox%7Bpositive+factor+0%7D%29+%2B+%5Cmu_%7Bi%7D%28%5Cmbox%7Bpositive+factor+1%7D%29+%2B+%5Cldots+%5Cmu_%7Bi%7D%28%5Cmbox%7Bpositive+factor+%7Dm%29&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;mu_{i}(&#92;mbox{positive factor 0}) + &#92;mu_{i}(&#92;mbox{positive factor 1}) + &#92;ldots &#92;mu_{i}(&#92;mbox{positive factor }m)' title='&#92;mu_{i}(&#92;mbox{positive factor 0}) + &#92;mu_{i}(&#92;mbox{positive factor 1}) + &#92;ldots &#92;mu_{i}(&#92;mbox{positive factor }m)' class='latex' />.</center></p>

<p>This model is what is going on <a href="https://ubiquity.mozilla.com/hg/ubiquity-firefox/raw-file/2bc28033a723/ubiquity/index.html#modules/parser/tng/parser.js">under the hood</a> in <a href="http://mitcho.com/blog/projects/a-demonstration-of-ubiquity-parser-2/">Ubiquity Parser 2</a>. The <code>Parser.Parse</code> class has a property called <code>.scoreMultiplier</code> which contains the score multiplier <img src='http://s0.wp.com/latex.php?latex=%5Cmu_i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='&#92;mu_i' title='&#92;mu_i' class='latex' /> as described above. A method called <code>.getMaxScore()</code> is implemented in addition to <code>.getScore()</code> so that, even before all of the nountype suggestion scores have been computed (e.g., in the case of asynchronous suggestions) <code>.getMaxScore()</code> can be used as an upper bound <img src='http://s0.wp.com/latex.php?latex=B_i&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='B_i' title='B_i' class='latex' /> and compared to the in-progress scores of other candidates and lower candidates can thus be taken out of consideration earlier in the parse process.</p>

<h3>Conclusion</h3>

<p>In this blog post I&#8217;ve laid out a few different iterations of approaches I&#8217;ve thought of on the problem of scoring and ranking Ubiquity suggestions in a smart way. While some of the basic mechanisms of front-loading the negative factors into a <code>scoreMultiplier</code> and the computation of the <code>maxScore</code> (or upper bound) have been implemented, the actual optimization algorithm described here of removing parses from consideration earlier in the parser query has yet to be implemented in Ubiquity Parser 2 and I look forward to seeing it in action. In addition, there are surely factors I haven&#8217;t considered in the scoring or further tricks to improve the optimized scoring algorithm. <strong>I&#8217;d love to get your feedback and ideas on this topic.</strong> Thanks!</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:2">
<p>In the case of Ubiquity Parser 2, we&#8217;ll let the &#8220;time&#8221; values <img src='http://s0.wp.com/latex.php?latex=t&#038;bg=ffffff&#038;fg=000&#038;s=0' alt='t' title='t' class='latex' /> refer to the &#8220;steps&#8221; in the derivation, as laid out in <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">the Ubiquity Parser 2 design</a>. Note that these &#8220;steps&#8221; are currently done in parallel across all candidates in the current architecture, making the &#8220;time&#8221; analogy legitimate. I will thus use integer time values here, making this a <a href="http://en.wikipedia.org/wiki/discrete-time">discrete-time</a> model.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p>Every Ubiquity parser query takes as a parameter the maximum number of suggestions to be returned. See <a href="https://ubiquity.mozilla.com/trac/ticket/532">the latest parser query interface proposal</a> for details on this interface.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:3">
<p>This naming is an homage to the <a href="http://en.wikipedia.org/wiki/rising sun lemma">rising sun lemma</a> of <a href="http://en.wikipedia.org/wiki/Frigyes Riesz">Frigyes Riesz</a> which uses a similar logic. The apparent connection to the fact that I am Japanese is purely coincidental.&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/' rel='bookmark' title='Scoring and Ranking Suggestions'>Scoring and Ranking Suggestions</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/this-week-on-ubiquity-parser-the-next-generation/' rel='bookmark' title='This week on Ubiquity Parser: The Next Generation'>This week on Ubiquity Parser: The Next Generation</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/scoring-for-optimization/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Attachment Ambiguity—or—when is the gyudon cheap?</title>
		<link>http://mitcho.com/blog/observation/attachment-ambiguity/</link>
		<comments>http://mitcho.com/blog/observation/attachment-ambiguity/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 06:17:05 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[attachment ambiguity]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[Japanese culture]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[syntax]]></category>
		<category><![CDATA[Tokyo]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1815</guid>
		<description><![CDATA[Every day on the way to work I walk by a fine establishment known as Yoshinoya (吉野家), Japan&#8217;s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by gyudon, it&#8217;s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/' rel='bookmark' title='User-Aided Disambiguation: a demo'>User-Aided Disambiguation: a demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/talking-ubiquity-in-japan-%e6%8b%a1%e5%bc%b5%e6%a9%9f%e8%83%bd%e5%8b%89%e5%bc%b7%e4%bc%9a%e3%81%ab%e3%81%a6%e7%99%ba%e8%a1%a8/' rel='bookmark' title='Talking Ubiquity in Japan: 拡張機能勉強会にて発表'>Talking Ubiquity in Japan: 拡張機能勉強会にて発表</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2009/04/yoshinoya.jpg" alt="yoshinoya.jpg" border="0" width="650" height="328" /></p>

<p>Every day on the way to work I walk by a fine establishment known as <a href="http://en.wikipedia.org/wiki/Yoshinoya">Yoshinoya</a> (吉野家), Japan&#8217;s largest <em>gyudon</em> (牛丼) chain restaurant. For those of you whose lives have yet to be graced by <a href="http://en.wikipedia.org/wiki/gyudon">gyudon</a>, it&#8217;s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.</p>

<p>Part of this sign, though, made me think about our <a href="http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/">new Ubiquity parser</a>. In particular, it was the <strong>attachment ambiguity</strong> in the end date of the promotion. The text in the photo above literally is &#8220;April 15th (Wed.) 8PM until&#8221;. (Note that Japanese is a strongly head-final language, and that the &#8220;until&#8221; is a postposition.) There are two possible readings for this expression, as illustrated by the two <a href="http://en.wikipedia.org/wiki/principle of compositionality">composition</a> trees below.</p>

<p><span id="more-1815"></span></p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2009/04/yoshinoya-trees.jpg" alt="yoshinoya-trees.jpg" border="0" width="658" height="157" /></center></p>

<p>The first tree, on the left, represents the reading &#8220;until (April 15th 8PM)&#8221;, while the second represents two arguments: &#8220;on April 15th&#8221; and &#8220;until 8PM&#8221;. In other words, in the first reading, the promotion begins at some earlier date and extends until April 15th at 8PM while, in the second reading, the promotion is one day only, on April 15th, until 8pm. Such syntactic ambiguities are called &#8220;attachment ambiguities&#8221; in linguistics as it is an ambiguity of where different arguments &#8220;attach&#8221; in a tree representation.</p>

<p>This attachment ambiguity was possible because there was no clear <a href="http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/">marker</a> on &#8220;April 15th,&#8221; which may have disambiguated it as &#8220;on April 15th&#8221;. In fact, in many languages this time position argument comes with no case marker or preposition, or it&#8217;s optional, making parsing for them difficult. If such a sentence is entered with spaces, the <a href="http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/">Ubiquity Parser: The Next Generation</a> would try a parse where &#8220;8PM&#8221; is the &#8220;until&#8221; or <code>goal</code> argument and &#8220;April 15th&#8221; is an <code>object</code> argument, but it will only check its noun type, not put it in <a href="http://mitcho.com/blog/projects/rolling-out-the-roles/">the correct semantic role</a> (<code>position</code>). Perhaps this is something to think about in the future.</p>

<p>These types of situations will surely come up as we continue work on the Ubiquity parser, making it essential to look at different languages. <strong>Are there certain kinds of arguments in your language that do not have any word-external markers such as case or prepositions/postpositions?</strong></p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/' rel='bookmark' title='User-Aided Disambiguation: a demo'>User-Aided Disambiguation: a demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/talking-ubiquity-in-japan-%e6%8b%a1%e5%bc%b5%e6%a9%9f%e8%83%bd%e5%8b%89%e5%bc%b7%e4%bc%9a%e3%81%ab%e3%81%a6%e7%99%ba%e8%a1%a8/' rel='bookmark' title='Talking Ubiquity in Japan: 拡張機能勉強会にて発表'>Talking Ubiquity in Japan: 拡張機能勉強会にて発表</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/attachment-ambiguity/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Scoring and Ranking Suggestions</title>
		<link>http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/</link>
		<comments>http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/#comments</comments>
		<pubDate>Tue, 07 Apr 2009 07:17:26 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[algorithm]]></category>
		<category><![CDATA[candidates]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[constraints]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[Optimality Theory]]></category>
		<category><![CDATA[order]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[score]]></category>
		<category><![CDATA[suggestions]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1745</guid>
		<description><![CDATA[I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I&#8217;d put some of these thoughts down in writing. The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/' rel='bookmark' title='Ubiquity Commands by The Numbers'>Ubiquity Commands by The Numbers</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">Parser The Next Generation</a> so I thought I&#8217;d put some of these thoughts down in writing.</p>

<p>The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four &#8220;scores&#8221; for each suggestion:</p>

<p><span id="more-1745"></span></p>

<ol>
<li><code>duplicateDefaultMatchScore</code>: 100 by default—lowered if an unused argument gets multiple suggestions (in <a href="https://ubiquity.mozilla.com/hg/ubiquity-firefox/file/0aaeae361c33/ubiquity/modules/parser/parser.js#l558">the words of the code</a>: &#8220;reduce the match score so that multiple entries with the same verb are only shown if there are no other verbs.&#8221;)</li>
<li><code>frequencyMatchScore</code>: a score from the <code>suggestion memory</code> of the frequency of the suggestion&#8217;s verb, given the input verb (currently the first word) or nothing, in the case of noun-first suggestions</li>
<li><code>verbMatchScore</code>: float in [0,1]: (as described <a href="https://wiki.mozilla.org/Labs/Ubiquity/Parser_Documentation#Scoring_the_Quality_of_the_Verb_Match">here</a>)

<ul>
<li>0.75 is returned in case there it is a noun-first suggestion (by virtue of the fact that <code>String.indexOf('')==0</code>)</li>
<li>1 if the verb name is equivalent across input-output</li>
<li>in [0.75,1) if the input is a prefix of the suggestion verb name</li>
<li>in [0.5,0.75) if the input is a non-prefix substring of the suggestion verb</li>
<li>in [0.25,0.5] if the input is a prefix of one of the <code>synonyms</code></li>
<li>in [0,0.25) if the input is a non-prefix substring of one of the <code>synonyms</code></li>
</ul></li>
<li><code>argMatchScore</code>: the number of arguments with matching &#8220;specific&#8221; nountypes, where &#8220;specific&#8221; is designated by the nountype having property <code>rankLast=false</code>.</li>
</ol>

<p>With the numeric scores for each of these criteria, a partial order of suggestions is constructed using a <a href="http://en.wikipedia.org/wiki/lexicographic order">lexicographic order</a>: that is, compare candidates first using <code>duplicateDefaultMatchScore</code>, break ties using <code>frequencyMatchScore</code>, if still tied break using <code>verbMatchScore</code>, and if still tied break using <code>argMatchScore</code>. This paradigm of constraints is called &#8220;strictly ranked&#8221; and a corollary of this is that lower constraints, no matter how well you score on them, can never overcome a loss at a higher constraint. A crucial corollary of this system is that lower constraints&#8217; scores need not be computed if a higher constraint already dooms it to a lower position.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<h3>Ranking in The Next Generation</h3>

<p>One of the goals of <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">Parser The Next Generation</a> is to make noun/argument-first input first-class citizens of Ubiquity, improving their suggestions in particular to the benefit of <a href="http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/">verb-final languages</a>. Arguments will be split up and tested against different noun types before a verb is even entered into the input, in which case target verbs can be ranked according to the appropriateness of the input&#8217;s arguments. As such, I believe the <code>argMatchScore</code> criteria above should either be ranked higher in a strictly ranked model or be allowed to overtake lower scores for the higher constraints in a non-strictly ranked model.</p>

<p>The <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">Parser The Next Generation</a> proposal and <a href="http://mitcho.com/code/ubiquity/parser-demo">demo</a> currently orders using a product of various criteria&#8217;s scores, rather than a lexicographic order of strictly ranked constraints. The component factors are:</p>

<ol>
<li><code>0.5</code> for parses where the verb was suggested</li>
<li><code>0.5</code> for each extra (>1) <code>object</code> argument (essentially &#8220;unused words&#8221; in the previous parser)</li>
<li>the score of each argument against that semantic role&#8217;s target noun type</li>
<li><code>0.8</code> for each unset argument of that verb</li>
</ol>

<p>Each component score is a value in [0,1], so the score is always non-decreasing across the derivation. This offers a natural way to optimize the candidate set creation: if a possible parse ever gets a score below a magic &#8220;threshold&#8221; value, it is immediately thrown away.</p>

<p>A possible problem with the current Parser TNG scoring model is that it will implicitly hinder verbs and parses with more arguments as it could have more sub-1 noun type score factors—this consideration may be great enough that a weighted additive model should be considered over a multiplicative one.</p>

<p><strong>How do you think we can make Ubiquity&#8217;s suggestion ranking smarter? What other factors should be considered, and what factors could be left out?</strong></p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>For all the linguists in the audience, if this sounds like <a href="http://en.wikipedia.org/wiki/Optimality Theory">Optimality Theory</a>, you would be right—there&#8217;s a little bit of <a href="http://roa.rutgers.edu/view.php3?roa=537">Prince and Smolensky (1993)</a> hanging out <a href="http://ubiquity.mozilla.com">in your browser</a>.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/' rel='bookmark' title='Ubiquity Commands by The Numbers'>Ubiquity Commands by The Numbers</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/scoring-and-ranking-suggestions/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Where&#8217;s The Verb?</title>
		<link>http://mitcho.com/blog/observation/wheres-the-verb/</link>
		<comments>http://mitcho.com/blog/observation/wheres-the-verb/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 07:10:20 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[commands]]></category>
		<category><![CDATA[infinitive]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[subjunctive]]></category>
		<category><![CDATA[typology]]></category>
		<category><![CDATA[ubiquity]]></category>
		<category><![CDATA[verb-final]]></category>
		<category><![CDATA[verbs]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1647</guid>
		<description><![CDATA[Ubiquity&#8217;s proposed new parser design is based on a principles and parameters philosophy: we can build an underlying universal parser and, for each individual language, we simply set some &#8220;parameters&#8221; to tell the parser how to act. As we consider the design&#8217;s pros and cons, it&#8217;s important to reflect back on the linguistic data and [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/' rel='bookmark' title='Ubiquity i18n: questions to ask'>Ubiquity i18n: questions to ask</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Ubiquity&#8217;s <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">proposed new parser design</a> is based on a <a href="http://en.wikipedia.org/wiki/principles and parameters">principles and parameters</a> philosophy: we can build an underlying universal parser and, for each individual language, we simply set some &#8220;parameters&#8221; to tell the parser how to act. As we consider the design&#8217;s pros and cons, it&#8217;s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.</p>

<p>Today I&#8217;ll examine highlight some disparate typological data to help us understand these questions: <strong>where&#8217;s the verb?</strong> and <strong>what does the verb look like?</strong>
<span id="more-1647"></span>
There are broadly three different verb forms taken in commands in different languages:<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<ol>
<li>the <a href="http://en.wikipedia.org/wiki/infinitive">infinitive</a>,</li>
<li><a href="http://en.wikipedia.org/wiki/subjunctive mood">subjunctive mood</a>, or</li>
<li>a special verb form such as <a href="http://en.wikipedia.org/wiki/imperative">imperative</a>, <a href="http://en.wikipedia.org/wiki/participial">participial</a>, or conjunctive (such as Japanese <a href="http://en.wikipedia.org/wiki/Japanese verb conjugations#Te_form">て form</a>)</li>
</ol>

<p>Let&#8217;s give an example of each:</p>

<p><strong>Infinitive</strong> (English):<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup></p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">Hit me!</pre></td></tr></table></div>


<p><strong>Subjunctive mood</strong> (Modern Greek): &#8220;Eat it all!&#8221;</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>2
3
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">Na   to fas olo!
SUBJ it eat all</pre></td></tr></table></div>


<p><strong>Imperative form</strong> (French): &#8220;Eat it!&#8221;</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>4
5
</pre></td><td class="code"><pre class="french" style="font-family:monospace;">Mange   -le!
eat.IMP it</pre></td></tr></table></div>


<p>It&#8217;s important to note that some languages have <em>multiple forms available</em> for the same command. For example:</p>

<p><strong>Dutch</strong>: three ways to say &#8220;watch out!&#8221; with the same verb</p>

<ol>
<li>Infinitive: <code>Oppassen!</code></li>
<li>Imperative: <code>Pas op!</code></li>
<li>Participial: <code>Opgepast!</code></li>
</ol>

<p>Similarly, I received <a href="http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/#comment-974">a great comment by PhiliKON</a> on German and <a href="http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/#comment-980">associated data by Robert Kaiser</a> on my blog post yesterday:</p>

<p><strong>German</strong>: &#8220;search hello with google&#8221;</p>

<ol>
<li>Infinitive: <code>hello mit google suchen</code></li>
<li>Imperative: <code>suche hello mit google</code></li>
</ol>

<p>In addition, German and Dutch are interesting as they are <a href="http://en.wikipedia.org/wiki/V2 word order">verb second (V2)</a> languages, so the verb may surface at the beginning or the end of the sentence, depending on the form.</p>

<p>The <a href="https://wiki.mozilla.org/User:Mitcho/ParserTNG">new parser design</a> (which <a href="http://mitcho.com/code/ubiquity/parser-demo/">you can demo</a>) assumes for simplicity that the verb should be found at the beginning or the end of the input, which is consistent with the data I&#8217;ve seen (modulo <a href="http://en.wikipedia.org/wiki/Clitic#Clitics_in_Romance_languages">clitics</a>). Multiple verb forms could be accounted for by supporting &#8220;synonyms&#8221; of the verbs.</p>

<p><strong>What are the different ways verbs are expressed in commands in your language? Is the verb always found at the beginning or the end of the sentence? Is it ever somewhere in the middle?</strong></p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Some of the data and theoretical support for this section comes from, among other sources, Sabine Iatridou&#8217;s <a href="http://web.mit.edu/linguistics/people/faculty/iatridou/de_modo_imperativo.pdf">De Modo Imperativo</a> lecture notes.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Many refer to this in English as an &#8220;imperative form,&#8221; but in Modern English this is arguably the same as the infinitive.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/' rel='bookmark' title='Ubiquity i18n: questions to ask'>Ubiquity i18n: questions to ask</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/wheres-the-verb/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Unnatural by design</title>
		<link>http://mitcho.com/blog/observation/unnatural-by-design/</link>
		<comments>http://mitcho.com/blog/observation/unnatural-by-design/#comments</comments>
		<pubDate>Sun, 01 Mar 2009 19:22:00 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[awkward]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[menu]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[photo]]></category>
		<category><![CDATA[translation]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1533</guid>
		<description><![CDATA[I&#8217;m flying over the pacific ocean right now but a little bit of language caught my eye. Here&#8217;s a picture of the menu for this flight, in three languages: English, Japanese, Chinese. What caught my eye is the line &#8220;served with ご一緒に 配,&#8221; meant to be read as part of &#8220;Beef in BBQ sauce&#8230; served [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m flying over the pacific ocean right now but a little bit of language caught my eye. Here&#8217;s a picture of the menu for this flight, in three languages: English, Japanese, Chinese.</p>

<p><img src="http://mitcho.com/blog/wp-content/uploads/2009/03/menu1.jpg" alt="menu.jpg" border="0" width="650" height="459" /></p>

<p>What caught my eye is the line &#8220;served with ご一緒に 配,&#8221; meant to be read as part of &#8220;Beef in BBQ sauce&#8230; <strong>served with</strong> Pepsi&#8230;&#8221;. The Chinese 配 (<em>pèi</em>) is fine here, meaning &#8220;with,&#8221; but the Japanese &#8220;ご一緒に&#8221; (<em>goissho-ni</em>) seemed awkward to me.</p>

<p><span id="more-1533"></span></p>

<p>The issue is that this adverbial meaning &#8220;together&#8221; normally comes <em>after</em> the &#8220;what it&#8217;s with&#8221; in an order like (1) (glossed in (2)):</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
</pre></td><td class="code"><pre class="japanese" style="font-family:monospace;">A B-と       ご一緒に
A B-and/with together</pre></td></tr></table></div>


<p>In other words, where English and Chinese both would say &#8220;A with B&#8221;, it is most natural in Japanese to say the equivalent of &#8220;A B with (together)&#8221;.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup> This is the reason why it seems unnatural to have anything between the &#8220;Beef in BBQ sauce&#8230;&#8221; line and &#8220;Pepsi&#8230;&#8221; line.</p>

<p>Looking at the rest of the menu, it&#8217;s clear that this isn&#8217;t a case where a native speaker wasn&#8217;t involved with the writing of the menu—the rest of the Japanese is perfect. <em>The Japanese modifier was inserted there just for the sake of parallel design, to the detriment of the text&#8217;s naturalness.</em> <strong>When have you seen design conflict with the structure of your language?</strong></p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>This can be generalized to a certain extent by noting that English and Chinese are both <a href="http://en.wikipedia.org/wiki/head-initial">head-initial</a> (aka &#8220;right branching&#8221;) languages, while Japanese is strongly <a href="http://en.wikipedia.org/wiki/head-final">head-final</a> (aka &#8220;left branching&#8221;).&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/unnatural-by-design/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Gaba, Shame On You</title>
		<link>http://mitcho.com/blog/observation/gaba-shame-on-you/</link>
		<comments>http://mitcho.com/blog/observation/gaba-shame-on-you/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 11:03:29 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[ads]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Engrish]]></category>
		<category><![CDATA[Gaba]]></category>
		<category><![CDATA[Japan]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[train]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1260</guid>
		<description><![CDATA[Here&#8217;s a picture of an ad for Gaba, a big English conversation school in Japan, I snapped on a train recently. I felt the English sentence about Gaba&#8217;s satisfaction was extremely awkward, so I put it up on twitter to check with some other native speakers. My friends concurred. What do you think? I personally [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/krashen-the-party/' rel='bookmark' title='Krashen The Party'>Krashen The Party</a></li>
<li><a href='http://mitcho.com/blog/observation/white-protestants-and-catholics-dont-frequently-attend-religious-services/' rel='bookmark' title='White Protestants and Catholics don&#8217;t frequently attend religious services'>White Protestants and Catholics don&#8217;t frequently attend religious services</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img class='limages' width='600' height='800' src="http://mitcho.com/blog/wp-content/uploads/2009/01/img_0012.jpg" alt="A Gaba ad on a train" title="gaba" /></p>

<p>Here&#8217;s a picture of an ad for <a href="http://en.wikipedia.org/wiki/Gaba">Gaba</a>, a big English conversation school in Japan, I snapped on a train recently. I felt the English sentence about Gaba&#8217;s satisfaction was extremely awkward, so I put it up on <a href="http://twitter.com/mitchoyoshitaka">twitter</a> to check with some other native speakers. My friends concurred. What do you think?</p>

<p>I personally think the sentence would be improved by removing the &#8220;the&#8221; in &#8220;the satisfaction.&#8221; Others offered &#8220;continues to rise&#8221; as possibly preferable to &#8220;continually rise.&#8221; English articles, especially the definiteness of abstract nouns, is very difficult for many non-native speakers. That being said, it&#8217;s sad for a sentence of such questionable acceptability to come from a company which, in theory, prides itself in its English ability and surely hires many native speakers. Gaba, shame on you.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
<li><a href='http://mitcho.com/blog/life/krashen-the-party/' rel='bookmark' title='Krashen The Party'>Krashen The Party</a></li>
<li><a href='http://mitcho.com/blog/observation/white-protestants-and-catholics-dont-frequently-attend-religious-services/' rel='bookmark' title='White Protestants and Catholics don&#8217;t frequently attend religious services'>White Protestants and Catholics don&#8217;t frequently attend religious services</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/gaba-shame-on-you/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>This is what a release looks like</title>
		<link>http://mitcho.com/blog/projects/release-downloads/</link>
		<comments>http://mitcho.com/blog/projects/release-downloads/#comments</comments>
		<pubDate>Wed, 10 Dec 2008 12:55:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[metablog]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[download]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[Mint]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[WordPress]]></category>
		<category><![CDATA[WordPress Planet]]></category>
		<category><![CDATA[YARPP]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1157</guid>
		<description><![CDATA[This is what the latest release (2.1.6) of my Yet Another Related Posts Plugin looked like under Mint, using my WordPress plugin downloads pepper, which in turn gets its data from wordpress.org: It&#8217;s always interesting to see these release spikes in download traffic. Note that this release was on the Wednesday but that was during [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/yet-another-related-posts-plugin-20/' rel='bookmark' title='Yet Another Related Posts Plugin 2.0'>Yet Another Related Posts Plugin 2.0</a></li>
<li><a href='http://mitcho.com/blog/projects/keep-up-with-yet-another-related-posts-plugin-with-rss/' rel='bookmark' title='Keep up with Yet Another Related Posts Plugin with RSS!'>Keep up with Yet Another Related Posts Plugin with RSS!</a></li>
<li><a href='http://mitcho.com/blog/projects/modifiying-wordpress-plugin-activation-behavior/' rel='bookmark' title='Modifiying WordPress plugin activation behavior'>Modifiying WordPress plugin activation behavior</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>This is what the latest release (2.1.6) of my <a href="/code/yarpp">Yet Another Related Posts Plugin</a> looked like under <a href="http://www.haveamint.com/">Mint</a>, using my WordPress plugin downloads pepper, which in turn gets its data from <a href="http://wordpress.org/extend/plugins/yet-another-related-posts-plugin/stats/">wordpress.org</a>:</p>

<p><img src="/blog/wp-content/uploads/2008/12/mint-wordpress-downloads.png" alt="" title="YARPP downloads 2.1.6" /></p>

<p>It&#8217;s always interesting to see these release spikes in download traffic. Note that this release was on the Wednesday but that was during the day, so Wednesday&#8217;s traffic is still higher than the normal ~300/day level, while the big peak (by day) is on Thursday. Too bad wordpress.org doesn&#8217;t give me hourly stats, though I guess that would be a little ridiculous.</p>

<p>YARPP is just about at that 35k download mark. I&#8217;m looking forward to the next release. ^^</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/yet-another-related-posts-plugin-20/' rel='bookmark' title='Yet Another Related Posts Plugin 2.0'>Yet Another Related Posts Plugin 2.0</a></li>
<li><a href='http://mitcho.com/blog/projects/keep-up-with-yet-another-related-posts-plugin-with-rss/' rel='bookmark' title='Keep up with Yet Another Related Posts Plugin with RSS!'>Keep up with Yet Another Related Posts Plugin with RSS!</a></li>
<li><a href='http://mitcho.com/blog/projects/modifiying-wordpress-plugin-activation-behavior/' rel='bookmark' title='Modifiying WordPress plugin activation behavior'>Modifiying WordPress plugin activation behavior</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/release-downloads/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Bald Moves</title>
		<link>http://mitcho.com/blog/observation/bald-moves/</link>
		<comments>http://mitcho.com/blog/observation/bald-moves/#comments</comments>
		<pubDate>Fri, 24 Oct 2008 16:20:36 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[link]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[bailout]]></category>
		<category><![CDATA[bald]]></category>
		<category><![CDATA[economics]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=896</guid>
		<description><![CDATA[On September 19th, Treasury Secretary Henry Paulson made a speech regarding the Troubled Assets Relief Program (TARP) to allay the fears of investors: I am convinced that this bald approach will cost American families far less than the alternative—a continuing series of financial institution failures and frozen credit markets unable to fund economic expansion. Unfortunately, [...]
No related posts.

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>On September 19th, Treasury Secretary Henry Paulson made a speech regarding the <a href="http://en.wikipedia.org/wiki/Troubled Assets Relief Program">Troubled Assets Relief Program</a> (TARP) to allay the fears of investors:</p>

<blockquote>
  <p>I am convinced that this <strong>bald approach</strong> will cost American families far less than the alternative—a continuing series of financial institution failures and frozen credit markets unable to fund economic expansion.</p>
</blockquote>

<p><object id="swfclipV3110753" width="421" height="376" type="application/x-shockwave-flash" data="http://www.thenewsroom.com/mash/swf/cube.swf?a=V3110753&amp;m=670372"><param name="movie" value="http://www.thenewsroom.com/mash/swf/cube.swf?a=V3110753&amp;m=670372"/><param name="allowScriptAccess" value="always"/><param name="base" value="." /><param name="wmode" value="transparent"/><param name="allowfullscreen" value="true"/></object></p>

<p>Unfortunately, the key phrase in this passage was widely mistranscribed in the media as a &#8220;bold approach.&#8221; But now that more details of the new Troubled Asset Relief Program have being released, Secretary Paulson&#8217;s true intentions are clear.</p>

<p>Chris Carey of <a href="http://bailoutsleuth.com/2008/10/a-new-appointment/">Bailout Sleuth</a> writes:</p>

<p><a href="http://mitcho.com/blog/wp-content/uploads/2008/10/thebaldteam.jpg" title="The Bald Team for a Bald Approach<br/>left to right: Neel Kashkari, James H. Lambright, and Henry M. Paulson, Jr." rel="lightbox[bald-moves]"><img class='limages' src="http://mitcho.com/blog/wp-content/uploads/2008/10/thebaldteam.jpg" title="The Bald Team" width="530" height="205" /></a></p>

<blockquote>
  <p>The Treasury Department tapped James H. Lambright [above center], head of the Export-Import Bank, as the interim chief investment officer for the $700 billion Troubled Asset Relief Program&#8230; The bailout program is being directed by Neel Kashkari [above left], who had been senior advisor to Treasury Secretary Henry M. Paulson Jr [above right].</p>
</blockquote>

<p>Will this new program stem the global credit crisis? Maybe. But at least we can all agree&#8230; it&#8217;s a bald move.</p>
<p>No related posts.</p>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/bald-moves/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>回収 vs. 収集 and Better Word Meanings Through Usage</title>
		<link>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/</link>
		<comments>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 14:50:27 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[Bailey]]></category>
		<category><![CDATA[cognitive linguistics]]></category>
		<category><![CDATA[corpora]]></category>
		<category><![CDATA[corpus]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[frame semantics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language learning]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[synonymy]]></category>
		<category><![CDATA[translation]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=721</guid>
		<description><![CDATA[Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction: 回収 may take things away from others when collecting while 収集 does not have that implication. Things that you 回収 may [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://bpick.tumblr.com/">Bailey</a> just asked me what the difference between 回収 (<em>kaishū</em>) and 収集(<em>shūshū</em>) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction:</p>

<ul>
<li>回収 may take things away from others when collecting while 収集 does not have that implication.</li>
<li>Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.<sup id="fnref:3"><a href="#fn:3" rel="footnote">1</a></sup></li>
</ul>

<p>Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: <a href="http://en.wikipedia.org/wiki/Google">Google</a>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup> To test my hypothesis, I chose two &#8220;objects of collection&#8221;, one you can take away (and often is distributed first) and one you can&#8217;t take away: アンケート (<em>ankēto</em> &#8220;survey,&#8221; from the French <em>enquête</em>) and 意見 (<em>iken</em> &#8220;opinion&#8221;). I then took the four resulting collocations<sup id="fnref:1"><a href="#fn:1" rel="footnote">3</a></sup> on Google in quotes (&#8220;•&#8221;) and recorded how many hits there were.</p>

<p><span id="more-721"></span></p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;意見を収集&#8221;</th><th>&#8220;意見を回収&#8221;</th><th>&#8220;アンケートを収集&#8221;</th><th>&#8220;アンケートを回収&#8221;</th></tr>
<tr><td>218000</td><td>6200</td><td>784</td><td>169000</td></tr>
</table>

<p>A better way to organize this data is as follows:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;↓を→&#8221;</th><th>回収</th><th>収集</th></tr>
<tr><th>アンケート</th><td>16900</td><td>784</td></tr>
<tr><th>意見</th><td>6200</td><td>218000</td></tr>
</table>

<p>This data clearly supports the hypothesis I laid out above: アンケート, which can be taken away from people and is often distributed first, occurs much more likely with 回収 than 収集. 意見, on the other hand, which crucially cannot be taken away when collected, occurs much more likely with 収集 than 回収.</p>

<p>While this one example doesn&#8217;t <em>prove</em> anything in and of itself, it does help clarify with data a nuance between two near synonyms. While my hypothesis was borne out here, native speaker intuitions on word nuances and distinctions can be unreliable.<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup> This type of quick test can be very helpful for language learners and instructors alike.</p>

<p>Languages very often have words which vary in very subtle ways. Just this Tuesday I went to a <a href="http://linguistic.meetup.com/58/">Tokyo Language Exchange Meetup</a>, a great <a href="http://en.wikipedia.org/wiki/meetup.com">meetup</a> which brought together various language learners and enthusiasts. A hot topic that night was words with very similar meanings—near synonyms. A few English learners were lamenting sets of words like {see, view, watch} and how difficult they are to learn. I myself have had the same experience studying Mandarin.</p>

<p>I noted that these difficulties in offering contrasting definitions often are due to the fact that word meanings are not just &#8220;what the word points to&#8221; but also the implication of &#8220;what it relates to&#8221;.<sup id="fnref:5"><a href="#fn:5" rel="footnote">5</a></sup> For example, &#8220;unborn baby&#8221; and &#8220;fetus&#8221; may point to the same thing, but are used in different contexts, in contrast to different other terms, for differing effect. Similarly &#8220;Death Tax&#8221; and &#8220;Estate Tax.&#8221; &#8220;Kneel&#8221; and &#8220;genuflect.&#8221;<sup id="fnref:6"><a href="#fn:6" rel="footnote">6</a></sup></p>

<p>The concept of word meanings being &#8220;what it points to&#8221; and &#8220;what it relates to&#8221; also helps explain why certain words are difficult to translate. Fillmore uses the Japanese example of ぬるい (<em>nurui</em>) which is the de facto translation of &#8220;lukewarm.&#8221; However, some Japanese speakers will only use ぬるい in contrast with &#8220;hot,&#8221; i.e., hot tea can become ぬるい over time but ice water does not become ぬるい. In contrast, English &#8220;lukewarm&#8221; can be used to describe things that are initially or prototypically hot or cold. &#8220;What the words point to&#8221; in this case is the same but &#8220;what it relates to&#8221; or, here, &#8220;what it contrasts with&#8221; is different, making it an imperfect (though very close) translation.</p>

<p>Every language has near synonyms which vary slightly in nuance but this nuance or &#8220;feeling&#8221; is borne out objectively in data. Looking at what words certain terms relate to <em>in real usage</em> is often the key to getting a richer understanding of vocabulary.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:3">
<p>This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (<em>mawa=ru</em>) can mean &#8220;circle back.&#8221;&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the <a href="http://en.wikipedia.org/wiki/n-gram">n-gram</a> <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">data they offer</a> for research.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p><a href="http://en.wikipedia.org/wiki/collocation">&#8221;Collocation&#8221; on Wikipedia</a> says: &#8220;Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.&#8221;&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>Hm&#8230; I just made a claim&#8230; looking for a citation.&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:5">
<p>&#8220;Relates to&#8221; here is not meant in an etymological sense. In <a href="http://en.wikipedia.org/wiki/frame semantics (linguistics)">frame semantics</a>, a part of <a href="http://en.wikipedia.org/wiki/cognitive linguistics">cognitive linguistics</a>, the &#8220;what the word points to&#8221; may be called a <strong>profile</strong> while the &#8220;what it relates to&#8221; is called the <strong>(semantic) frame</strong>. These distinctions are due to the work of <a href="http://en.wikipedia.org/wiki/Charles J. Fillmore">Fillmore</a> 1976.&#160;<a href="#fnref:5" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:6">
<p>The great examples in this section come from Bill Croft and D. Alan Cruse&#8217;s <em>Cognitive Linguistics</em>, 2004&#160;<a href="#fnref:6" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Oh Twitter, you&#8217;re so funny</title>
		<link>http://mitcho.com/blog/observation/oh-twitter-youre-so-funny/</link>
		<comments>http://mitcho.com/blog/observation/oh-twitter-youre-so-funny/#comments</comments>
		<pubDate>Wed, 25 Jun 2008 01:58:34 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[humor]]></category>
		<category><![CDATA[Twitter]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=458</guid>
		<description><![CDATA[I don&#8217;t think those two were related. Related posts: Oh Amazon, you&#8217;re so funny Related posts brought to you by Yet Another Related Posts Plugin.
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/oh-amazon-youre-so-funny/' rel='bookmark' title='Oh Amazon, you&#8217;re so funny'>Oh Amazon, you&#8217;re so funny</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href='http://mitcho.com/blog/wp-content/uploads/2008/06/ohtwitter.png'><img src="http://mitcho.com/blog/wp-content/uploads/2008/06/ohtwitter.png" alt="" title="Oh Twitter" width="285" height="126" class="alignnone size-medium wp-image-459" /></a></p>

<p>I don&#8217;t <em>think</em> those two were related.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/oh-amazon-youre-so-funny/' rel='bookmark' title='Oh Amazon, you&#8217;re so funny'>Oh Amazon, you&#8217;re so funny</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/oh-twitter-youre-so-funny/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Japanese Office</title>
		<link>http://mitcho.com/blog/observation/the-japanese-office/</link>
		<comments>http://mitcho.com/blog/observation/the-japanese-office/#comments</comments>
		<pubDate>Thu, 29 May 2008 17:22:00 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[funny]]></category>
		<category><![CDATA[humor]]></category>
		<category><![CDATA[Japanese culture]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[Mori no Ike]]></category>
		<category><![CDATA[parody]]></category>
		<category><![CDATA[The Office]]></category>
		<category><![CDATA[translation]]></category>
		<category><![CDATA[TV]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=297</guid>
		<description><![CDATA[I got hooked on The Office since I&#8217;ve been in Taiwan, which I watch at hulu.com via VPN. Checking for a new episode the other day, I found this clip from Steve Carell on Saturday Night Live this past weekend: The Japanese Office. I&#8217;ve been a fan of the SNL Digital Shorts since Lazy Sunday, [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/projects/mailplane-japanese-localization-available/' rel='bookmark' title='Mailplane Japanese localization available!'>Mailplane Japanese localization available!</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I got hooked on <a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&#038;location=http%3A%2F%2Fwww.amazon.com%2Fgp%2Fentity%2FThe-Office%2FB001CHC6NE%3Fie%3DUTF8%26%252AVersion%252A%3D1%26%252Aentries%252A%3D0&#038;tag=mitchocom-20&#038;linkCode=ur2&#038;camp=1789&#038;creative=390957">The Office</a><img src="https://www.assoc-amazon.com/e/ir?t=mitchocom-20&amp;l=ur2&amp;o=1" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> since I&#8217;ve been in Taiwan, which I watch at <a href="http://hulu.com">hulu.com</a> via <a href="http://en.wikipedia.org/wiki/VPN">VPN</a>. Checking for a new episode the other day, I found this clip from <a href="http://en.wikipedia.org/wiki/Steve Carell">Steve Carell</a> on <a href="http://en.wikipedia.org/wiki/Saturday Night Live">Saturday Night Live</a> this past weekend: <strong>The Japanese Office</strong>.</p>

<p><embed allowNetworking="all" allowScriptAccess="always" src="http://widgets.nbc.com/o/4727a250e66f9723/483ec1b834ea4542" width="650" height="478" quality="high" wmode="transparent" id="W483ec1b834ea4542" pluginspage="http://www.macromedia.com/go/getflashplayer" type="application/x-shockwave-flash"> </embed></p>

<p>I&#8217;ve been a fan of the <a href="http://en.wikipedia.org/wiki/SNL Digital Shorts">SNL Digital Shorts</a> since <a href="http://en.wikipedia.org/wiki/Lazy Sunday">Lazy Sunday</a>, but this is absolutely something else. It&#8217;s a brilliant piece of cross-cultural parody. Many on the <a href="http://www.hulu.com/watch/20337/saturday-night-live-snl-digital-short-the-japanese-office#s-p1-st-i1">associated Hulu page</a> had some questions, however, so I decided to write up a little explanation of what&#8217;s actually going on in this short, and why I love it so.<sup id="fnref:2"><a href="#fn:2" rel="footnote">1</a></sup></p>

<p><span id="more-297"></span></p>

<p>The Digital Short begins with the Japanese version of the intro sequence, including a shrine, a <a href="http://en.wikipedia.org/wiki/700 Series Shinkansen">700 series bullet train</a>, and the Scranton city sign now showing <a href="http://en.wikipedia.org/wiki/Amagasaki, Hyōgo">Amagasaki (尼崎市)</a>, a similarly industrial city near Osaka. <a href="http://en.wikipedia.org/wiki/Dwight Schrute">Dwight</a> shredding paper with Japanese text and <a href="http://en.wikipedia.org/wiki/Jim Halpert">Jim</a> eating noodles are nice touches. All the names, in case you were wondering, are possible Japanese names (modulo Jim&#8217;s actor&#8217;s name being in <a href="http://en.wikipedia.org/wiki/katakana">katakana</a>, and thus exclusively foreign). After <a href="http://en.wikipedia.org/wiki/Michael Scott">Michael Scott</a> with extra black hair readjusts his <a href="http://en.wikipedia.org/wiki/maneki neko">lucky cat</a> (<em>manekineko</em>, 招き猫), we get to the brilliant title card.</p>

<p><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/title1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/title1-300x206.png" alt="" title="title1" width="300" height="206" class="alignnone size-medium wp-image-300" /></a><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/title2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/title2-300x209.png" alt="" title="title2" width="300" height="209" class="alignnone size-medium wp-image-301" /></a></p>

<p>As Japanese lacks definite and indefinite articles, the word &#8220;The&#8221; is replaced with 「その」 <em>sono</em>, the demonstrative &#8216;that&#8217;, making the title literally &#8220;That Office.&#8221; Lacking a straightforward replacement for &#8220;The,&#8221; however, I feel that this is a very cute artifact of overly-faithful translation.</p>

<p>The short itself runs through the Japanese versions of a few key scenes from the first episode of The Office. In the first, <a href="http://en.wikipedia.org/wiki/Pam Beasly">Pam</a> is answering the phone and <a href="http://en.wikipedia.org/wiki/Michael Scott">Michael</a> interrupts her in his signature way, repeating her name (or, the Japanese equivalent of &#8220;Pam&#8221;, <em>pamu</em> パム) and then dropping <em>-san</em>, a personal name suffix—the equivalent of Mister or Miss—and smiling into the camera, content with his own cleverness. Pam says something indiscernible to Michael, referring to him as <em>Tanaka-san</em> (the &#8220;Mr. Smith&#8221; of Japan—even though his name plate accurately said &#8220;Michael Scott&#8221; マイケル・スコット), to which Michael mumbles 「そういうことです」, a phrase meaning &#8220;and that&#8217;s that,&#8221; or &#8220;and that is the case.&#8221; My guess is that this was the attempted translation of &#8220;that&#8217;s what she said.&#8221;</p>

<p><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/micahel1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/micahel1-300x220.png" alt="" title="micahel1" width="300" height="220" class="alignnone size-medium wp-image-303" /></a><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/michael2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/michael2-300x207.png" alt="" title="michael2" width="300" height="207" class="alignnone size-medium wp-image-304" /></a></p>

<p>Michael then goes on to say 「日本で一番面白いボスです」 (<em>nihon-de ichiban omoshiroi bosu desu</em>, &#8216;[I am] the most interesting (=funniest) boss in Japan&#8217;). Steve Carell&#8217;s snicker halfway through that line, in response to his trying really hard at producing it, is very cute. The mug itself says 「世界中で一番面白い社長」(&#8220;world&#8217;s funniest company president&#8221;). This reminds me of my dad when he speaks Japanese, in the best way possible. ^^</p>

<p><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/jim1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/jim1-300x195.png" alt="" title="jim1" width="300" height="195" class="alignnone size-medium wp-image-305" /></a><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/jim2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/jim2-300x207.png" alt="" title="jim2" width="300" height="207" class="alignnone size-medium wp-image-306" /></a></p>

<p>Next is of course the jello scene. Dwight picks up the phone with the phone-appropriate 「もしもし」 (<em>moshimoshi</em>) and Jim asks where the stapler is. Dwight yells back 「バカ！」 (<em>baka</em>, &#8216;stupid!&#8217;) and Pam laughs, though in the stereotypical Japanese female&#8217;s high pitch manner, appropriately covering her mouth (though Pam also actually does this in the original). Michael walks in and they all apologize, 「ごめんなさい」 <em>gomennasai</em>. Although the bowing is a bit excessive in a classical SNL parody way, the traditionally hierarchical status quo of Japanese offices is very succinctly reflected here.</p>

<p><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/dwight1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/dwight1-300x216.png" alt="" title="dwight1" width="300" height="216" class="alignnone size-medium wp-image-307" /></a><a rel="lightbox[the-japanese-office]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/dwight2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/dwight2-300x191.png" alt="" title="dwight2" width="300" height="191" class="alignnone size-medium wp-image-308" /></a></p>

<p>The next scene is also from the first episode of The Office, where Michael introduces himself, 「僕は君たちのリージョナル・マネージャ」 (<em>boku-wa kimitachi-no regional manager</em>, &#8216;I am your Regional Manager&#8217;). Dwight states that he is the &#8220;Assistant Regional Manager&#8221; (アシスタント・リージョナル・マネージャ) and then is corrected, 「リージョナル・マネージャのアシスタントです」 (<em>regional manager-<strong>no</strong> assistant desu</em>, &#8216;[you are] Assistant <strong>to</strong> the Regional Manager&#8217;). It&#8217;s very cool to see how this back and forth translates beautifully, and to see these actors execute it with the right timing and effect in a foreign language. Michael asks 「どうしてここにいるのだ」 (<em>doushite koko-ni irunoda</em>, &#8216;why are [you] here?&#8217;) and leads them in the <a href="http://en.wikipedia.org/wiki/Radio taiso">radio exercises</a> (ラジオ体操), saying 「じゃ、ラジオ体操をしましょう」 (<em>jya, rajio taisou-wo shimashou</em>, &#8216;well then, let&#8217;s do the radio exercises&#8217;).</p>

<p>Here&#8217;s another beautiful cultural point. These &#8220;radio exercises&#8221; are real, as every <a href="http://en.wikipedia.org/wiki/Mori no Ike">Mori no Ike</a> villager knows, broadcasted over public radio and TV every morning, and are often done <em>en masse</em> at schools and some businesses. Pam&#8217;s then notes 「田中さんはみんなの健康を心配しています」 (<em>Tanaka-san-wa minna-no kenkou-wo shinpai-shite-imasu</em>, &#8216;Mr. Tanaka (=Michael) is concerned for everyone&#8217;s health&#8217;) in a conference-room reflection, and we see Stanley doing the crossword again.</p>

<p><object width="425" height="355"><param name="movie" value="http://www.youtube.com/v/xS92XkVKM0Q&#038;hl=en"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/xS92XkVKM0Q&#038;hl=en" type="application/x-shockwave-flash" wmode="transparent" width="425" height="355"></embed></object></p>

<p>After hurting his shoulder and cooling it with some <a href="http://en.wikipedia.org/wiki/oolong">oolong</a> tea—the same bottle that was on Pam&#8217;s counter in the first scene—Michael reflects:</p>

<p>「今日はいい日でした。」 (<em>kyou-ha ii hi deshita</em>, &#8216;Today was a good day.&#8217;)<br />
「いい仕事をした。」 (<em>ii shigoto-wo shita</em>, &#8216;[I] did good work.&#8217;)<br />
「そう思う&#8230;かな？ はいはいはい！」 (<em>sou omou&#8230; kana? hai hai hai!</em>, &#8216;I think this way&#8230;? Yes yes yes!&#8217;)</p>

<p>The last line there is beautifully translated, capturing the essence of Michael in Japanese. As Japanese is a verb-final language, you literally say &#8220;blah blah blah I think&#8221; to mean &#8220;I think blah blah blah&#8221;, which may help explain the last phrase, 「そう思う」. Finally, the 「かな」 thrown in at the end turns the entire sentence, which was declarative up till then, into a question, which the bobble-head then answers. Brilliant!</p>

<p>In the final scene, Michael&#8217;s singing <a href="http://en.wikipedia.org/wiki/karaoke">karaoke</a> and Dwight yells 「かんぱい！」 (<em>kanpai!</em>, &#8216;bottoms up!&#8217;). The final credits list Sarah Sawyer and Hanna(h) Sawyer as producers&#8230; I wonder if they were actually involved with this Short or if they are also made up as well.</p>

<p>The details in the Short are great: the little Hello Kitties and origami, orchid plant on the reception desk (Japanese love orchids—or wait, maybe that&#8217;s just my grandfather), and all the copy paper that had 「コピー用紙」 (<em>kopii-youshi</em>, &#8216;copy paper&#8217;) pasted on. There&#8217;s a <a href="http://en.wikipedia.org/wiki/bonsai">bonsai</a> tree on Jim&#8217;s desk and his spoon is replaced with chopsticks.</p>

<p>If you want to get picky, of course, there are many rough edges&#8230; the incorrect use of 「ステープラー」 (how you would say &#8220;stapler&#8221; in Japanese) in lieu of 「ホッチキス」,<sup id="fnref:1"><a href="#fn:1" rel="footnote">2</a></sup>  some text being poorly typeset, etc. But overall, this SNL Digital Short was obviously written by someone with a solid (albeit stereotypical) understanding of Japanese culture and strong intermediate Japanese skills. If the goal was simply only to play off of Japanese stereotypes, accurate Japanese wouldn&#8217;t have even been necessary, and so I really appreciate the effort that went into this. In addition, Steve Carell et al&#8217;s delivery in a language they don&#8217;t speak, in my opinion, is commendable.</p>

<p>&#8220;It&#8217;s funny because it&#8217;s racist,&#8221; in the best way possible. Bravo!</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:2">
<p>With the exception of the 「お姫様」(&#8216;princess&#8217;) tampon ad&#8230; this is obviously targeting Japanese ads with random foreigners, like the crazy <a href="http://en.wikipedia.org/wiki/Bob Sapp">Bob Sapp</a> pizza commercials (below), but I honestly don&#8217;t think this five second &#8220;ad&#8221; is funny and simply distracts from the rest of the piece.<br /><embed id="VideoPlayback" style="width:400px;height:326px" flashvars="" src="http://video.google.com/googleplayer.swf?docid=-6501830897084806455&#038;hl=en" type="application/x-shockwave-flash"> </embed>&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p>See <a href="http://en.wikipedia.org/wiki/Hotchkiss">Hotchkiss</a> for an explanation.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/projects/mailplane-japanese-localization-available/' rel='bookmark' title='Mailplane Japanese localization available!'>Mailplane Japanese localization available!</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/the-japanese-office/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Testing Google&#8217;s Language Detection</title>
		<link>http://mitcho.com/blog/observation/testing-googles-language-detection/</link>
		<comments>http://mitcho.com/blog/observation/testing-googles-language-detection/#comments</comments>
		<pubDate>Sat, 17 May 2008 09:47:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[Chinese characters]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[computational linguistics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language detection]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=254</guid>
		<description><![CDATA[As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/google-code.png" alt="google code" title="google-code" width="156" height="57" /></p>

<p>As <a href="http://googleblog.blogspot.com/2008/05/google-translate-adds-10-new-languages.html">Google adds ten more languages to its machine translation service</a>, it seems to be on its way to becoming the most convenient <a href="http://en.wikipedia.org/wiki/universal translator">universal translator</a> of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of queries in Chinese character (<a href="http://en.wikipedia.org/wiki/Chinese characters">漢字/汉字</a>) scripts. In this post, I run some tests using Google&#8217;s <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">Language Detection service</a> to try to better understand its handling of Chinese character queries.</p>

<h3>Background</h3>

<p>Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the &#8220;CJKV&#8221;). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there&#8217;s <a href="http://en.wikipedia.org/wiki/Chinese character">Wikipedia</a> or <a href="http://books.google.com/books?id=htlttpi1KOoC">Ken Lunde&#8217;s CJKV Information Processing</a>.</p>

<h3>The problem</h3>

<p>The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, <strong>Google doesn&#8217;t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.</strong><sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<p><span id="more-254"></span></p>

<p>Take, for example, a query like &#8220;七面鳥&#8221;, meaning &#8216;turkey&#8217; in Japanese. While all characters are very common in traditional Chinese (鳥 is simplified to 鸟 in China), the combination &#8220;七面鳥&#8221; is quite rare in Chinese. However, when you search for &#8220;七面鳥,&#8221; many of the first results are in Chinese and only two of the first ten results are in Japanese.</p>

<p>Does Google&#8217;s corpus not identify &#8220;七面鳥&#8221; as a primarily Japanese word? Google does indeed attest to this fact: searching for &#8220;七面鳥&#8221; and limiting to a certain language yields the following number of hits. A similar effect can be seen with Japanese words such as &#8220;芝生&#8221; (&#8216;grass&#8217;) or &#8220;泥棒&#8221; (&#8216;burglar&#8217;). The &#8220;Japanese on first page&#8221; column gives the number of results that are in Japanese which come up in a language-unspecified search from the US.</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (simplified)</th><th> Chinese (traditional)</th><th> Japanese </th><th>Japanese on<br />first page</th></tr>
<tr><th>七面鳥</th><td>786</td><td>926</td><td>395,000</td><td>2/10</td></tr>
<tr><th>芝生</th><td>55,600</td><td>216,000</td><td>2,230,000</td><td>0/10</td></tr>
<tr><th>泥棒</th><td>13,500</td><td>22,500</td><td>10,400,000</td><td>3/10</td></tr>
</table>

<p>In a perfect world, I would like Google to <strong>identify the language that the query is in</strong>, and then <strong>weigh results that are in that language higher</strong> in the results list. So the issue comes down to one of <strong>language detection</strong>.</p>

<iframe src="http://www.google.com/uds/samples/language/detect.html" width='400px' height="200px"></iframe>

<p>There are broadly two different approaches to language detection and, indeed, all natural language processing problems: <em>parsing</em> and <em>counting</em>. In this case, parsing involves trying to break apart the query into words and then computing how likely such a string of <em>words</em> is in each given language. Counting simply takes an inventory of the characters given and compares them to their frequencies in each language, computing how likely such a string of <em>characters</em> is in each language. Parsing is the &#8220;smarter&#8221; approach, but more difficult and computationally intensive.</p>

<p>Google was kind enough to give us an <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">language detection AJAX service</a> so we can get a sense for how their language detection works. This service also gives a &#8220;confidence&#8221; value on the detection result. For the rest of this entry, we&#8217;ll test some hypotheses against this service and conclude at the end.</p>

<h3>Do spaces matter?</h3>

<p><strong>No.</strong> While spaces are sometimes used in Japanese and Chinese writing to represent word boundaries, especially around numbers and roman letters, they also are seen on the web to encourage line breaks. It would make sense for Google&#8217;s language detection service to ignore spaces in Chinese character queries and that does seem to be the case. All tests I ran with Chinese character queries gave the same result with same confidence with and without spaces in random places.</p>

<h3>Does order matter?</h3>

<p><strong>No.</strong> This was slightly disappointing to see. I took the Japanese string &#8220;骨粗鬆症&#8221; (&#8216;osteoporosis&#8217;, if you&#8217;re curious) and ran every permutation against the language detector and got the same results, including the same confidence values. This is a clear indicator that Google uses only counting, not parsing, in their parser.</p>

<h3>Does repetition matter?</h3>

<p><strong>Yes.</strong> Now that it seems that Google does not use any parsing and only uses character frequencies in identifying the source language, let&#8217;s see how repetition can affect the detection service.</p>

<p>First, I took some Chinese character strings and ran them through the detection service with different numbers of repetitions, e.g. &#8220;参加&#8221;, &#8220;参加参加&#8221;, &#8220;参加参加参加&#8221;, &#8220;参加参加参加参加&#8221;&#8230; The queries I used were the following:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (traditional)</th><th>Japanese</th><th>Chinese (simplified)</th></tr>
<tr><th>木</th><td>X</td><td>X</td><td>X</td></tr>
<tr><th>漢字</th><td>X</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>氣</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>參加</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>参加</th><td>&nbsp;</td><td>X</td><td>X</td></tr>
<tr><th>気</th><td>&nbsp;</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>气</th><td>&nbsp;</td><td>&nbsp;</td><td>X</td></tr>
</table>

<p>For each token type, the detection service made up its mind quite quickly. Its confidence, however, was more interesting.</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-7.png" alt="" title="repetition vs. confidence" /></center></p>

<p>Each of the confidence values dips sharply after three, five, or ten repetitions. Note, however, the length of the tokens which dipped at each of those points. I interpret this to mean that <strong>there is a different parser for less than ten characters and ten or more characters.</strong> However, the detection service did not change its answer after this point on any of the tokens.</p>

<p>Second, I took two characters, &#8220;簡&#8221; and &#8220;体,&#8221; and crossed different numbers of them together to see how that would affect the language detected. Note that &#8220;簡&#8221; is used in traditional Chinese and Japanese, while &#8220;体&#8221; is used in simplified Chinese and Japanese.</p>

<p><style type="text/css">
table .zh { background-color: #e3d2d2; }
table .zh-Hant { background-color: #d3e3d2; }
table .ja { background-color: #d5d2e3; }
</style></p>

<table style="margin-left:auto;margin-right:auto;">
<tr><th>&nbsp;</th><th>簡x0</th><th>簡x1</th><th>簡x2</th><th>簡x3</th><th>簡x4</th><th>簡x5</th><th>簡x6</th><th>簡x7</th><th>簡x8</th><th>簡x9</th></tr>
<tr><th>体x0</th><td>&nbsp;</td> <td class='zh'>0.995</td> <td class='zh'>0.998</td> <td class='zh'>0.998</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x1</th><td class='zh-Hant'>0.995</td> <td class='ja'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x2</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.531</td> </tr>
<tr><th>体x3</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.52</td> <td class='ja'>0.568</td> </tr>
<tr><th>体x4</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.516</td> <td class='ja'>0.565</td> <td class='ja'>0.613</td> </tr>
<tr><th>体x5</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.512</td> <td class='ja'>0.561</td> <td class='ja'>0.609</td> <td class='ja'>0.657</td> </tr>
<tr><th>体x6</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.507</td> <td class='ja'>0.556</td> <td class='ja'>0.605</td> <td class='ja'>0.653</td> <td class='ja'>0.702</td> </tr>
<tr><th>体x7</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.502</td> <td class='ja'>0.551</td> <td class='ja'>0.6</td> <td class='ja'>0.649</td> <td class='ja'>0.697</td> <td class='ja'>0.746</td> </tr>
<tr><th>体x8</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>1</td> <td class='ja'>0.545</td> <td class='ja'>0.595</td> <td class='ja'>0.644</td> <td class='ja'>0.693</td> <td class='ja'>0.741</td> <td class='ja'>0.79</td> </tr>
<tr><th>体x9</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>1</td> <td class='ja'>0.539</td> <td class='ja'>0.589</td> <td class='ja'>0.638</td> <td class='ja'>0.687</td> <td class='ja'>0.736</td> <td class='ja'>0.785</td> <td class='ja'>0.834</td> </tr>
</table>

<table style="margin-left:auto;margin-right:auto;">
<tr><td class="ja">Japanese</td><td class='zh-Hant'>Chinese (traditional)</td><td class='zh'>Chinese (simplified)</td></tr>
</table>

<h3>Conclusion</h3>

<p>For Chinese character-based languages, Google&#8217;s language detection algorithm uses simple counting rather than parsing, identifying languages by looking at the <em>frequency of characters</em> rather than the <em>frequency of words</em>. As such, the algorithm simply acts as a <strong>script detector, not a language detector.</strong> Moreover, as a simple counting method is used, duplicating characters used in one language but not another can very easily skew the resulting output.</p>

<p>As a trivial aside, it seems that Google&#8217;s algorithm is slightly different for strings less than ten characters, as can be seen in a dip and then rise of confidence values after ten characters.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Just to complicate matters further, there&#8217;s also the issue of where you&#8217;re accessing Google from. For example, accessing from the US (or via my friend <a href="http://support.uchicago.edu/docs/network/vpn/">VPN</a>), a query for the Japanese-simplified &#8220;天気&#8221; seems to only return Japanese pages. However, accessing from Taiwan, Google assumes you may have meant the full-form &#8220;天氣&#8221;, giving you pages with both &#8220;天気&#8221; and &#8220;天氣&#8221;. As a result, Yahoo Japan weather is the first result from the US and third from Taiwan, while Yahoo Taiwan weather is first in Taiwan and doesn&#8217;t even show up from the US. This default character substitution in Taiwan is one of my least-favorite Google &#8220;features.&#8221;<br /><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1-300x256.png" alt="" title="picture-1"/></a><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2-300x256.png" alt="" title="picture-2"/></a><br />Similar effects can most likely be seen between the US and China. In the rest of this post, all queries will be made from the US.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/testing-googles-language-detection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>White Protestants and Catholics don&#8217;t frequently attend religious services</title>
		<link>http://mitcho.com/blog/observation/white-protestants-and-catholics-dont-frequently-attend-religious-services/</link>
		<comments>http://mitcho.com/blog/observation/white-protestants-and-catholics-dont-frequently-attend-religious-services/#comments</comments>
		<pubDate>Wed, 13 Feb 2008 02:27:34 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[election]]></category>
		<category><![CDATA[entailment]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[politics]]></category>
		<category><![CDATA[religion]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2008/02/13/white-protestants-and-catholics-dont-frequently-attend-religious-services/</guid>
		<description><![CDATA[Breaking news from the Potomac Primaries: White Protestants and Catholics backed Mrs. Clinton, but Mr. Obama was strongly supported by voters who frequently attend religious services. Seeing as backing Mrs. Clinton and supporting Mr. Obama are, in terms of votes, mutually exclusive, this sentence entails that white Protestants and Catholics (the majority of ) are [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/obama-for-taiwan-2008/' rel='bookmark' title='Obama for Taiwan 2008'>Obama for Taiwan 2008</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Breaking news from <a href="http://thecaucus.blogs.nytimes.com/2008/02/12/live-blogging-the-potomac-primaries/">the Potomac Primaries</a>:</p>

<blockquote>
  <p>White Protestants and Catholics backed Mrs. Clinton, but Mr. Obama was strongly supported by voters who frequently attend religious services.</p>
</blockquote>

<p>Seeing as backing Mrs. Clinton and supporting Mr. Obama are, in terms of votes, mutually exclusive, this sentence entails that white Protestants and Catholics (the majority of ) are not a part of &#8220;voters who frequently attend religious services&#8221;, as is demonstrated by the infelicity of the following sentence:</p>

<p>&#8220;Group A did A, and Group B did not do A — but Group A is part of Group B.&#8221;</p>

<p>Well, that just settles it then.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/obama-for-taiwan-2008/' rel='bookmark' title='Obama for Taiwan 2008'>Obama for Taiwan 2008</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/white-protestants-and-catholics-dont-frequently-attend-religious-services/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>iTunes Movie Rentals: the movies you watch once?</title>
		<link>http://mitcho.com/blog/observation/itunes-movie-rentals-the-movies-you-watch-once/</link>
		<comments>http://mitcho.com/blog/observation/itunes-movie-rentals-the-movies-you-watch-once/#comments</comments>
		<pubDate>Thu, 17 Jan 2008 10:16:34 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[iTunes]]></category>
		<category><![CDATA[keynote]]></category>
		<category><![CDATA[movie]]></category>
		<category><![CDATA[Steve Jobs]]></category>
		<category><![CDATA[tech]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2008/01/17/itunes-movie-rentals-the-movies-you-watch-once/</guid>
		<description><![CDATA[Yesterday Steve Jobs introduced, among other things, iTunes movie rentals. Rent a movie and download it over broadband. You then have 30 days to start the film, and then 24 hours to finish it before it turns into a pumpkin. A lot of people are complaining about the 24 hours, including some with good reason [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/great-news-you-can-opt-out-from-omnitures-1921681122o7net/' rel='bookmark' title='Great news! You can opt-out from Omniture&#8217;s 192.168.112.2o7.net'>Great news! You can opt-out from Omniture&#8217;s 192.168.112.2o7.net</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Yesterday Steve Jobs <a href="http://www.apple.com/pr/library/2008/01/15itunes.html?sr=hotnews">introduced, among other things, iTunes movie rentals</a>. Rent a movie and download it over broadband. You then have 30 days to start the film, and then 24 hours to finish it before it turns into a pumpkin. A lot of people are complaining about the 24 hours, including <a href="http://pogue.blogs.nytimes.com/2008/01/15/the-27-hour-day/">some with good reason</a> and apparently many who have kids.</p>

<p>So why rental? Thus spoke Steve: &#8220;Your favorite movie&#8230; most of us watch movies once&#8230; maybe a few times.&#8221;<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup> Currently number eight on the top rentals is one of <a href="http://en.wikipedia.org/wiki/Paul Sally">Paul Sally&#8217;s</a> favorites, <a href="http://www.amazon.com/gp/product/B00000K0DT?ie=UTF8&#038;tag=mitchocom-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=B00000K0DT"><em>The Usual Suspects</em></a><img src="http://www.assoc-amazon.com/e/ir?t=mitchocom-20&#038;l=as2&#038;o=1&#038;a=B00000K0DT" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
. From the iTunes Store description:</p>

<blockquote>
  <p>There are a handful of movies that demand a second viewing—because they&#8217;re so good, or because a surprise ending gives every scene a new meaning when it&#8217;s watched a second time. <em>The Usual Suspects</em> is both.</p>
</blockquote>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>23:45 into the keynote.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/great-news-you-can-opt-out-from-omnitures-1921681122o7net/' rel='bookmark' title='Great news! You can opt-out from Omniture&#8217;s 192.168.112.2o7.net'>Great news! You can opt-out from Omniture&#8217;s 192.168.112.2o7.net</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/itunes-movie-rentals-the-movies-you-watch-once/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Patricks Nortons on Tekzillaz</title>
		<link>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/</link>
		<comments>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/#comments</comments>
		<pubDate>Wed, 09 Jan 2008 15:42:49 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[California]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[plural]]></category>
		<category><![CDATA[tech]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2008/01/09/patricks-nortons-on-tekzillaz/</guid>
		<description><![CDATA[I just noticed something on the latest Tekzilla Daily: Patrick Norton, host of Tekzilla and former host of the Screen Savers says &#8220;there&#8217;s a lots to learn here&#8221; (1:28) and then later &#8220;the site you&#8217;re having troubles with&#8221; (1:39). While &#8220;having troubles with&#8230;&#8221; is fine, I believe &#8220;having trouble with&#8230;&#8221; is much more common. As [...]
No related posts.

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I just noticed something on the latest <a href="http://revision3.com/tekzilla/tzdaily/2008-01-09ping">Tekzilla Daily</a>: <a href="http://en.wikipedia.org/wiki/Patrick Norton">Patrick Norton</a>, host of <a href="http://www.tekzilla.com">Tekzilla</a> and former host of <a href="http://en.wikipedia.org/wiki/the Screen Savers">the Screen Savers</a> says &#8220;there&#8217;s a lot<em>s</em> to learn here&#8221; (1:28) and then later &#8220;the site you&#8217;re having trouble<em>s</em> with&#8221; (1:39). While &#8220;having troubles with&#8230;&#8221; is fine, I believe &#8220;having trouble with&#8230;&#8221; is much more common. As for &#8220;a lots to learn,&#8221; however, that&#8217;s definitely out. Is it hyperarticulation? I don&#8217;t know.</p>

<p>Wikipedia notes: &#8220;Norton grew up in the <a href="http://en.wikipedia.org/wiki/Midwest">Midwest</a>, but considers the <a href="http://en.wikipedia.org/wiki/Jersey Shore">Jersey Shore</a> his home&#8230; He currently lives in <a href="http://en.wikipedia.org/wiki/San Francisco, California">San Francisco, California</a>.&#8221; So, is this a Jersey Shore or California thing? I have no idea.</p>
<p>No related posts.</p>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

