<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; language</title>
	<atom:link href="http://mitcho.com/blog/tag/language/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Fri, 10 Feb 2012 23:24:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-19719</generator>
		<item>
		<title>Ubiquity Localization Update</title>
		<link>http://mitcho.com/blog/projects/ubiquity-localization-update/</link>
		<comments>http://mitcho.com/blog/projects/ubiquity-localization-update/#comments</comments>
		<pubDate>Fri, 12 Jun 2009 10:35:23 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[commands]]></category>
		<category><![CDATA[gettext]]></category>
		<category><![CDATA[i18n]]></category>
		<category><![CDATA[internationalization]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[l10n]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[localization]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=2219</guid>
		<description><![CDATA[As we move closer and closer to shipping a Ubiquity with there is still much work to be done, particularly in the area of localization. In a recent Ubiquity meeting we laid out the explicit localization goals and non-goals of as follows: Goals for 0.5 Parser 2 (on by default) underlying support for localization of [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/big-issues-and-small-issues-with-parser-2/' rel='bookmark' title='Big Issues and Small Issues with Parser 2'>Big Issues and Small Issues with Parser 2</a></li>
<li><a href='http://mitcho.com/blog/projects/localizing-ubiquity-commands-and-nountypes/' rel='bookmark' title='Localizing Ubiquity: commands and nountypes'>Localizing Ubiquity: commands and nountypes</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/' rel='bookmark' title='Ubiquity Commands by The Numbers'>Ubiquity Commands by The Numbers</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>As we move closer and closer to shipping a Ubiquity with <a href="https://labs.mozilla.com/2009/05/ubiquity-05-call-for-participation/">there is still much work to be done</a>, particularly in the area of localization. <a href="https://wiki.mozilla.org/Labs/Ubiquity/Meetings/2009-05-27_Weekly_Meeting">In a recent Ubiquity meeting</a> we laid out the explicit localization goals and non-goals of as follows:</p>

<ul>
<li>Goals for 0.5

<ul>
<li>Parser 2 (on by default)</li>
<li>underlying support for localization of commands</li>
<li>localization of standard feed commands for a few languages</li>
<li>Parser 2 language files for those same languages</li>
</ul></li>
<li>Nongoals for 0.5

<ul>
<li>distribution/sharing of localizations</li>
<li>localization of nountypes </li>
</ul></li>
</ul>

<p>The overall goal for this release of Ubiquity is to come up with a format and standard for localization. Localizations in Ubiquity 0.5 will only apply to commands bundled with Ubiquity, and the localization files themselves will be distributed with Ubiquity. In a future release we will tackle the problem of localizations for <a href="https://ubiquity.mozilla.com/herd/">commands in the wild</a> and truly croud-source<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup> this process.</p>

<p><span id="more-2219"></span></p>

<h3>Localization Architecture</h3>

<p>The localization of Ubiquity commands will use a <a href="http://en.wikipedia.org/wiki/gettext">gettext</a>-style approach where localization files list key-value pairs for different properties and messages of the commands. For Ubiquity 0.5, where we only deal with the standard command feeds bundled Ubiquity, we can simply place all the localization files in <a href="https://ubiquity.mozilla.com/hg/ubiquity-firefox/file/tip/ubiquity/standard-feeds/localization"><code>ubiquity/standard-feeds/localization</code></a>. Localization files are organized by source feed, with one localization file per source feed, per language.</p>

<p>The localizable components of commands will include the <code>names</code>, <code>contributors</code>, and <code>help</code> properties, as well as any localizable strings in the command&#8217;s <code>preview()</code> and <code>execute()</code> methods. To make strings localizable in <code>preview()</code> and <code>execute()</code>, they must be wrapped in the localize function, <code>_()</code>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup></p>

<p>Other localizable components, like <code>names</code>, <code>contributors</code>, and <code>help</code> will not need to be wrapped in the <code>_()</code> function. In addition, as the localization files can only hold values of strings, for values such as names and contributors, the delimiter <code>|</code> can be used to delimit multiple values.</p>

<p><pre code='javascript'>zoom.names=ズーム|ズームして|ズームする|ズームしろ</pre></p>

<h3>The Localization Experience</h3>

<p><a href="https://ubiquity.mozilla.com/trac/ticket/739">One tool we have planned</a> to help kickstart the localization process is a tool that will automatically create a template of strings that need localization in a user&#8217;s commands. I took a first stab at this tool today. Clicking on the &#8220;get localization template&#8221; link next to each feed in the <a href="chrome://ubiquity/content/cmdlist.html">Ubiquity command list</a> will give you a template which you can then copy into a text file:</p>

<p><a class='limages' href='http://mitcho.com/blog/wp-content/uploads/2009/06/localization-template.png' rel='lightbox'><img src="http://mitcho.com/blog/wp-content/uploads/2009/06/localization-template-smaller.png" alt="localization-template-smaller.png" border="0" width="600" height="437" /></a></p>

<p>Additionally, instructions will later be added to this page to specify how and where to save localizations to test them or perhaps we can add a button that will automatically save it in the right location.</p>

<h3>Open Questions</h3>

<h4>Localization file formats</h4>

<p>There are two kinds of file formats for localizations we are considering: <a href="https://developer.mozilla.org/En/XUL_Tutorial/Property_Files"><code>.properties</code></a> and <code>.po</code>, the native <a href="http://en.wikipedia.org/wiki/gettext">gettext</a> format. As an example, here is the same key-value pair in the two formats:</p>

<h5><code>.properties</code>:</h5>

<p><pre># This is a comment
welcomeMessage=Hello, world!</pre></p>

<h5><code>.po</code>:</h5>

<p><pre>#. This is a comment (the . is actually optional)
msgid "welcomeMessage"
msgstr "Hello, world!"</pre></p>

<p>The advantage of <code>.properties</code> over <code>.po</code> is that Mozilla natively supports this format with an XUL/XPCOM interface called <a href="https://developer.mozilla.org/En/XUL/Stringbundle">stringbundle</a> and it is what is used to localize JavaScript in Firefox itself. We actually already have the <code>_()</code> localization function working with the properties file format, following <a href="http://www.xuldev.org/blog/?p=45">gomita&#8217;s great instructions</a> (Japanese) on how to load properties files in using Mozilla&#8217;s native <a href="https://developer.mozilla.org/En/XUL/Stringbundle">stringbundle</a> tools via JavaScript.</p>

<p>The advantage of <code>.po</code> over <code>.properties</code> is that it is the de-facto standard in localization, particularly in the UNIX world. Lots of great tools have been written for it. The adoption of <code>.po</code> could make Ubiquity localization more accessible for more people. Another advantage is that <code>.po</code> files can have keys with spaces, as I note below.</p>

<p>If we do opt to work with <code>.po</code> files, the two libraries I see out in the wild for dealing with <code>.po</code> files are <a href="http://code.google.com/p/gettext-js/">gettext-js</a> (MIT) and <a href="http://jsgettext.berlios.de/">jsgettext</a> (LGPL). While I haven&#8217;t looked at the libraries in depth yet, so far jsgettext seems to be the winner, as some sections of gettext-js require the use of the <a href="http://www.prototypejs.org/">prototype.js</a> library.</p>

<h4>A &#8220;key&#8221; question</h4>

<p><img src="http://mitcho.com/blog/wp-content/uploads/2009/06/icanhaskeyplz.jpg" alt="icanhaskeyplz.jpg" border="0" width="650" height="416" /></p>

<p>In either file format, we need a unique way to refer to each localizable string—a key format. As each localization file refers to a command feed, the first collision we must avoid is the command name. With this in mind, we can come up with some trivial keys for the localizable properties: (here, consider the command <code>hello</code>)</p>

<ul>
<li><code>hello.names</code></li>
<li><code>hello.contributors</code></li>
<li><code>hello.help</code></li>
</ul>

<p>However, we run into difficulty when we try to come up with keys for the arbitrary text in <code>preview</code>s and <code>execute</code>s. For example, for a message like &#8220;Hello world!&#8221; in the preview, we could simply make the key <code>hello.preview.Hello world!</code> but this may be unruly and be prone to typos. In addition, in <code>.properties</code> files keys cannot have certain characters in them, like spaces, so we would have to make the key something like <code>hello.preview.Hello_world!</code> or, stripping symbols and standardizing case, <code>hello.preview.HELLO_WORLD</code>.</p>

<p>Keys could also get very long with this type of key format, although here again <code>.po</code> files may have an advantage as they can stay relatively more legible even with long keys. One option to deal with this would be to optionally supply a key argument to <code>_()</code> so that it is used instead of the automatic key. For example, suppose the <code>hello</code> command&#8217;s <code>preview()</code> included this code:</p>


<div class="wp_syntax"><div class="code"><pre class="javascript" style="font-family:monospace;">_<span style="color: #009900;">&#40;</span><span style="color: #3366CC;">'This is a really long greeting message. Hello there!'</span><span style="color: #339933;">,</span><span style="color: #3366CC;">'longmessage'</span><span style="color: #009900;">&#41;</span></pre></div></div>


<p>then a localizer would only have to refer to <code>hello.preview.longmessage</code>, not <code>hello.preview.THIS_IS_A_REALLY_LONG_GREETING_MESSAGE_HELLO_THERE</code>.</p>

<p><a href="http://twitter.com/m_satyr">satyr</a> points out that some commands use another function to incorporate similar actions and messages in both <code>preview()</code> and <code>execute()</code>. In this case, he argues, it wouldn&#8217;t make sense to have to keep both localizations (<code>hello.preview.</code>&#8230; and <code>hello.execute.</code>&#8230;). He suggests that optional keys (mentioned above) could be used without the <code>preview.</code> or <code>execute.</code> infixes, as in <code>hello.longmessage</code>. By taking out the <code>preview</code> and <code>execute</code> namespacing in the localization keys, though, it becomes the command author&#8217;s responsibility to not accidentally use strings named &#8220;names&#8221;, &#8220;help&#8221;, etc. that will have unintended consequences.</p>

<h3>Conclusion</h3>

<p>I hope that this blog post gives people an idea of the progress we&#8217;ve made in the localization area and gets people thinking about the challenges we still face. <strong>We&#8217;d love to get your feedback on the localization format and process in Ubiquity, as well as the open problems of the file format and keys.</strong></p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Or &#8220;cloud-source&#8221;&#8230; finally a Japanese accent joke that&#8217;s semantically stable!&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>This function currently also has the ability to do simple <a href="http://en.wikipedia.org/wiki/printf">printf</a>-formatted string replacements:<br/>
<pre code='javascript'>_('This is a %S.',['test'])</pre>
<br/>Whether this format will replace support for <code>CmdUtils.renderTemplate</code> remains to be seen and is definitely worthy of discussion. If we move away from <a href="https://developer.mozilla.org/En/XUL_Tutorial/Property_Files">properties files</a>, in particular, we may keep <code>renderTemplate()</code> in lieu of the <a href="http://en.wikipedia.org/wiki/printf">printf</a> format. Mozilla&#8217;s built-in <a href="https://developer.mozilla.org/En/XUL/Stringbundle">stringbundle handling</a> just gave us a fast and free implementation of <a href="http://en.wikipedia.org/wiki/printf">printf</a>-style replacement.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/big-issues-and-small-issues-with-parser-2/' rel='bookmark' title='Big Issues and Small Issues with Parser 2'>Big Issues and Small Issues with Parser 2</a></li>
<li><a href='http://mitcho.com/blog/projects/localizing-ubiquity-commands-and-nountypes/' rel='bookmark' title='Localizing Ubiquity: commands and nountypes'>Localizing Ubiquity: commands and nountypes</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/' rel='bookmark' title='Ubiquity Commands by The Numbers'>Ubiquity Commands by The Numbers</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/ubiquity-localization-update/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>Ubiquity presentation at Tokyo 2.0</title>
		<link>http://mitcho.com/blog/projects/ubiquity-presentation-at-tokyo-20/</link>
		<comments>http://mitcho.com/blog/projects/ubiquity-presentation-at-tokyo-20/#comments</comments>
		<pubDate>Wed, 10 Jun 2009 09:54:13 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[bilingual]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[demo]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[events]]></category>
		<category><![CDATA[GoaP]]></category>
		<category><![CDATA[Japan]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[screencast]]></category>
		<category><![CDATA[Tokyo]]></category>
		<category><![CDATA[ubiquity]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=2203</guid>
		<description><![CDATA[This past Monday I presented at Tokyo 2.0, Japan&#8217;s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/' rel='bookmark' title='Changes to Ubiquity Parser 2 and the Playpen'>Changes to Ubiquity Parser 2 and the Playpen</a></li>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/life/notes-from-barcamp-tokyo-2009/' rel='bookmark' title='Notes from BarCamp Tokyo 2009'>Notes from BarCamp Tokyo 2009</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2009/06/t2p01.png" alt="T2P0.PNG" border="0" width="211" height="120" /></p>

<p>This past Monday I presented at <a href="http://www.tokyo2point0.net/events/tokyo-20-25-the-web-language">Tokyo 2.0</a>, Japan&#8217;s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from <a href="http://labs.goo.ne.jp/intl/">goo Labs</a>, Shinjyou Sunao of <a href="http://www.knowlec.com/">Knowledge Creation</a>, developers of the <a href="http://www.vdsapi.ne.jp/">Voice Delivery System</a> API, and <a href="http://globalvoicesonline.org/author/chris-salzberg/">Chris Salzberg</a> of <a href="http://globalvoicesonline.org/">Global Voices Online</a> on community translation.</p>

<p>I just put together a video of my Ubiquity presentation, mixing <a href="http://www.ustream.tv/recorded/1625213">the audio recorded live</a> at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.</p>

<p><object width="649" height="365"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=5091071&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=5091071&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="649" height="365"></embed></object><br /><a href="http://vimeo.com/5091071">Ubiquity: Command the Web with Language 言葉で操作する Web</a> from <a href="http://vimeo.com/mitchoyoshitaka">mitcho</a> on <a href="http://vimeo.com">Vimeo</a>.</p>

<p><span id="more-2203"></span>
The event also coincided with <a href="http://www.linkedin.com/in/davemcclure">Dave McClure&#8217;s</a> <a href="http://www.geeksonaplane.com/">Geeks on a Plane</a> Asia tour, attracting even more interest to the event. In the end it was the largest Tokyo 2.0 event ever.</p>

<p>As I <a href="http://twitter.com/mitchoyoshitaka/status/1980687478">leave Tokyo next month</a>, I&#8217;ll be sad to not be able to continue to be a part of Tokyo 2.0. I&#8217;ve met a lot of fascinating people and learned a lot at the monthly events. I&#8217;ll definitely make sure to schedule them in in my future travels back to Japan and I highly recommend any of you who travel to Tokyo do so as well.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/' rel='bookmark' title='Changes to Ubiquity Parser 2 and the Playpen'>Changes to Ubiquity Parser 2 and the Playpen</a></li>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/life/notes-from-barcamp-tokyo-2009/' rel='bookmark' title='Notes from BarCamp Tokyo 2009'>Notes from BarCamp Tokyo 2009</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/ubiquity-presentation-at-tokyo-20/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Changes to Ubiquity Parser 2 and the Playpen</title>
		<link>http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/</link>
		<comments>http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/#comments</comments>
		<pubDate>Fri, 05 Jun 2009 08:21:47 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[screencast]]></category>
		<category><![CDATA[ubiquity]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=2173</guid>
		<description><![CDATA[Here&#8217;s a quick screencast highlighting some of the changes to Parser 2 and the updated Parser 2 Playpen. This video should be particularly useful to people hoping to add their language to Parser 2. It&#8217;s also a good reference for Ubiquity core developers. Changes to Ubiquity Parser 2 + Playpen from mitcho on Vimeo. All [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/projects/a-demonstration-of-ubiquity-parser-2/' rel='bookmark' title='A Demonstration of Ubiquity Parser 2'>A Demonstration of Ubiquity Parser 2</a></li>
<li><a href='http://mitcho.com/blog/how-to/adding-your-language-to-ubiquity-parser-2/' rel='bookmark' title='Adding Your Language to Ubiquity Parser 2'>Adding Your Language to Ubiquity Parser 2</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a quick screencast highlighting some of the changes to Parser 2 and the updated <a href="chrome://parser-demo/content/index.html">Parser 2 Playpen</a>. This video should be particularly useful to people hoping to <a href="http://mitcho.com/blog/how-to/adding-your-language-to-ubiquity-parser-2/">add their language to Parser 2</a>. It&#8217;s also a good reference for Ubiquity core developers.</p>

<p><object width="649" height="365"><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="movie" value="http://vimeo.com/moogaloop.swf?clip_id=5013787&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" /><embed src="http://vimeo.com/moogaloop.swf?clip_id=5013787&amp;server=vimeo.com&amp;show_title=1&amp;show_byline=1&amp;show_portrait=0&amp;color=00ADEF&amp;fullscreen=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="649" height="365"></embed></object><br /><a href="http://vimeo.com/5013787">Changes to Ubiquity Parser 2 + Playpen</a> from <a href="http://vimeo.com/mitchoyoshitaka">mitcho</a> on <a href="http://vimeo.com">Vimeo</a>.</p>

<p>All the features covered, as with all Parser 2 features, require that you <a href="https://wiki.mozilla.org/Labs/Ubiquity/Ubiquity_0.1_Development_Tutorial">get the latest Ubiquity code</a> from our Mercurial repository.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/foxkeh-demos-ubiquity-parser-the-next-generation/' rel='bookmark' title='Foxkeh demos Ubiquity Parser: The Next Generation'>Foxkeh demos Ubiquity Parser: The Next Generation</a></li>
<li><a href='http://mitcho.com/blog/projects/a-demonstration-of-ubiquity-parser-2/' rel='bookmark' title='A Demonstration of Ubiquity Parser 2'>A Demonstration of Ubiquity Parser 2</a></li>
<li><a href='http://mitcho.com/blog/how-to/adding-your-language-to-ubiquity-parser-2/' rel='bookmark' title='Adding Your Language to Ubiquity Parser 2'>Adding Your Language to Ubiquity Parser 2</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/changes-to-ubiquity-parser-2-and-the-playpen/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>The Hit List: Better Software Through Less UI</title>
		<link>http://mitcho.com/blog/link/the-hit-list-better-software-through-less-ui/</link>
		<comments>http://mitcho.com/blog/link/the-hit-list-better-software-through-less-ui/#comments</comments>
		<pubDate>Wed, 25 Mar 2009 12:48:32 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[link]]></category>
		<category><![CDATA[AppleScript]]></category>
		<category><![CDATA[design]]></category>
		<category><![CDATA[interface]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[Mac OS X]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[natural syntax]]></category>
		<category><![CDATA[tasks]]></category>
		<category><![CDATA[thought process]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1658</guid>
		<description><![CDATA[The Hit List is a to-do list app for Mac OS X with a beautiful interface and some nice features. Creator Andy Kim&#8217;s latest blog post (Better Software Through Less UI) is excellent reading for the Ubiquity community. He describes the thought process behind the design of a new clean and &#8220;frictionless&#8221; interface for specifying [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/' rel='bookmark' title='How natural should a natural interface be?'>How natural should a natural interface be?</a></li>
<li><a href='http://mitcho.com/blog/projects/friendlier-command-feed-subscription/' rel='bookmark' title='Friendlier command feed subscription'>Friendlier command feed subscription</a></li>
<li><a href='http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/' rel='bookmark' title='User-Aided Disambiguation: a demo'>User-Aided Disambiguation: a demo</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.potionfactory.com/thehitlist/">The Hit List</a> is a to-do list app for Mac OS X with a beautiful interface and some nice features. Creator Andy Kim&#8217;s latest blog post (<a href="http://www.potionfactory.com/blog/2009/03/10/better-software-through-less-ui">Better Software Through Less UI</a>) is excellent reading for the Ubiquity community. He describes the thought process behind the design of a new clean and &#8220;frictionless&#8221; interface for specifying how tasks are repeated. After throwing out the regular combinations and templates of different input widgets, <em>his solution was to implement a partial natural language input interface:</em></p>

<p><img src="http://www.potionfactory.com/files/blog/2009/03/repeating_task_1.png"/></p>

<blockquote>
  <p>There is no myriad of buttons and fields to choose from. All the user has to do is directly type in what he wants.</p>
</blockquote>

<p>Here are a couple other choice quotes which will ring true for the Ubiquity users and internationalization folks in the audience:</p>

<blockquote>
  <p>For this to work without driving the user mad, the natural language parser has to be near perfect. The last thing I want is for this to come out smelling like AppleScript.</p>
</blockquote>

<p><span></span></p>

<blockquote>
  <p><strong>Problems</strong><br/>This design isn&#8217;t perfect as it has two glaring problems. One is that the user has no easy way of discovering how complex the recurrence rules can be. This isn&#8217;t such a huge problem, but a way to solve this is to include a help button to show example rules or to include an accompanying iCal style UI to let the user setup the recurrence rule in a more typical fashion. I didn&#8217;t include these in the initial implementation though because I wanted to see how users would react to this kind of UI.<br/>Another problem is localization. Even if I write parsers for a few more popular languages, it won&#8217;t accommodate the rest of the users in the world. Again, the solution is an accompanying traditional UI, but for now, I&#8217;m leaving it the way it is until I get some feedback.</p>
</blockquote>

<p>There&#8217;s a trend in the wind, my friends: the incorporation of near-natural language for more <a href="http://humanized.com/weblog/2006/06/01/why_humane_is_a_better_word_than_usable/">humane</a> interfaces.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/' rel='bookmark' title='How natural should a natural interface be?'>How natural should a natural interface be?</a></li>
<li><a href='http://mitcho.com/blog/projects/friendlier-command-feed-subscription/' rel='bookmark' title='Friendlier command feed subscription'>Friendlier command feed subscription</a></li>
<li><a href='http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/' rel='bookmark' title='User-Aided Disambiguation: a demo'>User-Aided Disambiguation: a demo</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/link/the-hit-list-better-software-through-less-ui/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>User-Aided Disambiguation: a demo</title>
		<link>http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/</link>
		<comments>http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/#comments</comments>
		<pubDate>Sat, 14 Mar 2009 06:08:24 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[interface]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[jQuery]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[natural syntax]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1572</guid>
		<description><![CDATA[A few weeks ago I made some visual mockups of how Ubiquity could look and act in Japanese. Part of this proposal was what I called &#8220;particle identification&#8221;: that is, immediate in-line identification of delimiters of arguments, which can be overridden by the user: The inspiration for this idea came from Aza&#8217;s blog post &#8220;Solving [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/' rel='bookmark' title='How natural should a natural interface be?'>How natural should a natural interface be?</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/friendlier-command-feed-subscription/' rel='bookmark' title='Friendlier command feed subscription'>Friendlier command feed subscription</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>A few weeks ago I made some visual mockups of <a href="http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/">how Ubiquity could look and act in Japanese</a>. Part of this proposal was what I called &#8220;particle identification&#8221;: that is, immediate in-line identification of delimiters of arguments, which can be overridden by the user:</p>

<p><center><img src='http://mitcho.com/blog/wp-content/uploads/2009/02/particle-id.png'/></center></p>

<p>The inspiration for this idea came from Aza&#8217;s blog post <a href="http://www.azarask.in/blog/post/solving-the-it-problem/">&#8220;Solving the &#8216;it&#8217; problem&#8221;</a> which advocates for this type of quick feedback to the user in cases of ambiguity. Such a method would help both the user better understand what is being interpreted by the system, as well as offer an opportunity for the user to correct improper parses. I just tried mocking up such an input box using <a href="http://jquery.com">jQuery</a>.</p>

<h3>➔ <a href='http://mitcho.com/code/ubiquity/ambiguity-demo/'>Try the User-Aided Disambiguation Demo</a></h3>

<p>If you have any bugfixes to submit or want to play around with your own copy, the demo code is <a href="http://bitbucket.org/mitcho/ubiquity-parser-tng/">up on BitBucket</a>. ^^ Let me know what you think!</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/' rel='bookmark' title='How natural should a natural interface be?'>How natural should a natural interface be?</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
<li><a href='http://mitcho.com/blog/projects/friendlier-command-feed-subscription/' rel='bookmark' title='Friendlier command feed subscription'>Friendlier command feed subscription</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/user-aided-disambiguation-a-demo/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Contribute: how your language identifies its arguments</title>
		<link>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/</link>
		<comments>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 09:37:18 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[coding properties]]></category>
		<category><![CDATA[contribute]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[grammatical relations]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1450</guid>
		<description><![CDATA[Earlier today I blogged on three different strategies languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs. I gathered some data from the fantastic World Atlas of Language Structures to put together a survey of many of the languages on the Internet. For each [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Earlier today <a href="http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/">I blogged on three different strategies</a> languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs.</p>

<p>I gathered some data from the fantastic <a href="http://wals.info/">World Atlas of Language Structures</a> to put together a survey of many of the languages on the Internet. For each of the languages, I got the canonical word order and whether the language marks the role of its argument on the verb and/or the arguments themselves.</p>

<iframe width='605' height='300' frameborder='0' src='http://spreadsheets.google.com/pub?key=pE-nN92qp_pa5P6YbUOw0HQ&#038;output=html'></iframe>

<p>As you can see, there are a number of data points that are still missing. <strong>Please contribute information on the languages you speak!</strong> You can <a href="http://spreadsheets.google.com/ccc?key=pE-nN92qp_pa5P6YbUOw0HQ">edit the spreadsheet on Google Docs</a>. Thanks!</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>How natural should a natural interface be?</title>
		<link>http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/</link>
		<comments>http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/#comments</comments>
		<pubDate>Mon, 16 Feb 2009 11:00:14 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[AppleScript]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[interface]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[Mozilla]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[natural syntax]]></category>
		<category><![CDATA[tasks]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1382</guid>
		<description><![CDATA[I&#8217;m very happy to announce that, starting today, I will be working full-time on Ubiquity, a Mozilla Labs experiment to connect the web with language. I&#8217;ll be heading up research on different linguistic issues of import to a linguistic user interface and blogging about these topics here. If you&#8217;re interested, please subscribe to my blog&#8217;s [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/yarpp-3-templates/' rel='bookmark' title='Using Templates with YARPP 3'>Using Templates with YARPP 3</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><em>I&#8217;m very happy to announce that, starting today, I will be working full-time on <a href="http://ubiquity.mozilla.com">Ubiquity</a>, a <a href="http://labs.mozilla.com">Mozilla Labs</a> experiment to connect the web with language. I&#8217;ll be heading up research on different linguistic issues of import to a linguistic user interface and blogging about these topics here. If you&#8217;re interested, please subscribe to <a href="http://mitcho.com/blog/feed/blog-only/">my blog&#8217;s RSS feed</a> or <a href="http://mitcho.com/blog/tag/ubiquity/feed/">the RSS feed for only Ubiquity-related items</a>. Commenting is encouraged! ^^</em></p>

<p>Every day, more users are trying out Ubiquity, the Mozilla Labs experiment that lets users accomplish common Internet tasks faster through a natural language interface. As we live more and more of our lives on the web, there is a huge appeal to—and need for—a faster way to access and mashup our information.</p>

<p>But what exactly do we mean by a &#8220;natural language interface&#8221;? Is it just another programming language with lots of English keywords? Should the final goal be a computer that understands everything we tell it?</p>

<p><img src="http://mitcho.com/blog/wp-content/uploads/2009/02/ubiqhal2.jpg" alt="Ubiquity is not HAL" title="I'm sorry Dave, I'm afraid I can't do that." width="650" height="220" /></p>

<p>As we think about the future directions and possibilities of Ubiquity, we need to go back to our roots and understand the project&#8217;s motivations. With that in mind, here are some initial thoughts on the advantages of a natural language interface. The ultimate goal here is to refine the notion of natural language interface and to come up with a set of principles that we can follow in pushing Ubiquity further, into other languages and beyond.</p>

<p><span id="more-1382"></span></p>

<h3>Why language?</h3>

<p>In his <a href="http://doi.acm.org/10.1145/1330526.1330535">2008 article in interactions</a>, <a href="http://azarask.in">Aza</a> describes a clear need for modern UI to move beyond monolithic do-everything apps into efficient, granular commands that can be connected to accomplish tasks. Hierarchical menus with an application&#8217;s every possible function are great for discoverability, but slow and inefficient as they grow. Aza advocates for the use of a familiar subset of natural language to this end. In his own words,</p>

<blockquote>
  <p>Words can capture abstractions that pictures cannot because language has an immense amount of descriptive and differentiating power. Abstract thoughts are exactly represented by the words that give them names. It is this power that comes to the rescue in specifying functionality.</p>
</blockquote>

<p>In other words, language gives us the descriptive power to succinctly and creatively express our will, far faster than a series of menus, and with more freedom than a series of shortcuts or gestures. In addition, by tapping into the lexicon of our every day language, we make a direct attack on the learnability problem.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<h3>The natural syntax test</h3>

<p>The ability to string different commands together is not a novel one—indeed, this is what more traditional command lines and programming languages offer. However, these technologies present a huge barrier to the layperson, even for languages with many keywords from English or English-like syntax.</p>

<p>Programming languages can be such teases in this way. Often the first bits of code in a language look remarkably similar to natural language (<a href="http://en.wikipedia.org/wiki/Python">Python</a>):</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;Hello World&quot;</span></pre></div></div>


<p>&#8230;but the young coder is quickly disappointed:</p>


<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #008000;">map</span><span style="color: black;">&#40;</span><span style="color: #ff7700;font-weight:bold;">lambda</span> x: x<span style="color: #66cc66;">*</span><span style="color: #ff4500;">2</span>, <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span>,<span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span></pre></div></div>


<p><a href="http://en.wikipedia.org/wiki/AppleScript">AppleScript</a> is a language which tries to take this idea further and, indeed, sometimes AppleScript code constitutes readable English.</p>


<div class="wp_syntax"><div class="code"><pre class="applescript" style="font-family:monospace;"><span style="color: #0066ff;">print</span> pages <span style="color: #000000;">1</span> <span style="color: #ff0033;">thru</span> <span style="color: #000000;">5</span> <span style="color: #ff0033; font-weight: bold;">of</span> <span style="color: #0066ff;">document</span> <span style="color: #000000;">2</span></pre></div></div>


<p>Dig a little deeper, though, and AppleScript also fails the &#8220;natural syntax&#8221; test. In fact, it can be argued that a language that <em>looks</em> like a natural language but differs in some important details can be even more difficult to use than one that is completely novel. Bill Cook, one of the original developers of <a href="http://en.wikipedia.org/wiki/AppleScript">AppleScript</a>, makes this point in <a href="http://www.cs.utexas.edu/~wcook/Drafts/2006/ashopl.pdf">his history of AppleScript</a>: &#8220;in hindsight, it is not clear whether it is easier for novice users to work with a scripting language that resembles natural language, with all its special cases and idiosyncrasies.&#8221;</p>

<p><strong>If the interface&#8217;s syntax is too restrictive or, worse, conflicts with a user&#8217;s natural intuitions about their natural language, it immediately fails to be &#8220;natural&#8221;, no matter how similar the keywords or grammar is.</strong><sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup></p>

<h3>Towards a natural (and forgiving) syntax</h3>

<p>Aza similarly laments the relegation of text-based interfaces to the higher echelons of geekdom in his 2008 paper: &#8220;if commands were memorable, and their syntax <em>forgiving</em>, perhaps we wouldn&#8217;t be so scared to reconsider these interface paradigms.&#8221;</p>

<p>The key word &#8220;forgiving&#8221; above (emphasis mine) is two-ways ambiguous, both of which we want a natural language interface to be:</p>

<ol>
<li><em>Forgiving</em> as in &#8220;not difficult to learn and remember&#8221;: the syntax must be easy and natural for the user, encouraging experimentation and intuitive application;</li>
<li><em>Forgiving</em> as in &#8220;not correcting or prescriptive&#8221;: the system should try its darndest to accept the user&#8217;s input, even if it&#8217;s not the most &#8220;well-formed.&#8221;</li>
</ol>

<p>From an implementation point of view, (2) above can also be an advantage. There are many grammatical restrictions in natural language which, as long as the command is unambiguous, Ubiquity need not enforce on the user. Take, for example, the two statements:</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
</pre></td><td class="code"><pre class="ubiquity" style="font-family:monospace;">print two copy
print two copies</pre></td></tr></table></div>


<p>I feel that Ubiquity should execute both of these statements with equal ease. The numeral &#8220;two&#8221; makes the user&#8217;s intent very clear, even though the plural of &#8220;copy&#8221; should indeed be &#8220;copies.&#8221; It need not be the job of the interface to decide whether a sentence is &#8220;correct English.&#8221; By assuming that the user is trying to communicate a valid and possible task, rather than throwing up an error, the system will be more flexible and more forgiving in the inevitable case of human error. <strong>The ultimate goal should be to help the user accomplish their task.</strong></p>

<h3>Conclusion</h3>

<p>By developing a language interface which truly <em>feels</em> natural to the user, we can successfully bring the power of text-based interfaces to the masses. I feel the key to this &#8220;natural-ness&#8221; is a less restrictive and in fact <em>forgiving</em> syntax. While this goal akin to <a href="http://en.wikipedia.org/wiki/natural language programming">natural language programming</a> may be daunting from an implementation angle, and it may indeed prove impossible, as long as the goal is to execute simple imperative commands, the scope of the target syntactic structures is limited.</p>

<p>Ubiquity as it stands is many different things for many people. The natural language guidelines above may feel too restrictive to many current developers for whom Ubiquity is simply a convenient new way to extend Firefox.<sup id="fnref:3"><a href="#fn:3" rel="footnote">3</a></sup> This discussion also seems orthogonal to the <a href="http://www.azarask.in/blog/post/can-ubiquity-be-used-only-with-the-mouse/">mouse-based Ubiquity experiments</a>. <strong>As users and developers, how do you feel about the potential benefits and downsides of these natural syntax guidelines?</strong> In the coming days I&#8217;ll look at some concrete examples of what this &#8220;forgiving&#8221; syntax would demand of Ubiquity.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>The learnability problem of a linguistic interface, particularly in light of the usability vs. discoverability paradigm, is a topic for a future post. <img src='http://mitcho.com/blog/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> &#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>It&#8217;s important to note that the &#8220;restrictions&#8221; I&#8217;m concerned with here are syntactic ones, not lexical ones. That is, if either of the Ubiquity commands below fail because we don&#8217;t have a &#8220;pass&#8221; verb, that&#8217;s fine. But if Ubiquity can only allow one string but not the other, that&#8217;s a syntactic restriction which goes against our English intuition.


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
</pre></td><td class="code"><pre class="ubiquity" style="font-family:monospace;">pass Jono the ball
pass the ball to Jono</pre></td></tr></table></div>




I&#8217;ll cover this in a future post.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:3">
<p>In fact, I myself am also guilty of this&#8230; my <a href="http://mitcho.com/code/select/">select command</a> for SQL queries clearly does not encourage a natural language-compatible syntax.&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/yarpp-3-templates/' rel='bookmark' title='Using Templates with YARPP 3'>Using Templates with YARPP 3</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/' rel='bookmark' title='回収 vs. 収集 and Better Word Meanings Through Usage'>回収 vs. 収集 and Better Word Meanings Through Usage</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>回収 vs. 収集 and Better Word Meanings Through Usage</title>
		<link>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/</link>
		<comments>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 14:50:27 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[Bailey]]></category>
		<category><![CDATA[cognitive linguistics]]></category>
		<category><![CDATA[corpora]]></category>
		<category><![CDATA[corpus]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[frame semantics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language learning]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[synonymy]]></category>
		<category><![CDATA[translation]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=721</guid>
		<description><![CDATA[Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction: 回収 may take things away from others when collecting while 収集 does not have that implication. Things that you 回収 may [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://bpick.tumblr.com/">Bailey</a> just asked me what the difference between 回収 (<em>kaishū</em>) and 収集(<em>shūshū</em>) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction:</p>

<ul>
<li>回収 may take things away from others when collecting while 収集 does not have that implication.</li>
<li>Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.<sup id="fnref:3"><a href="#fn:3" rel="footnote">1</a></sup></li>
</ul>

<p>Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: <a href="http://en.wikipedia.org/wiki/Google">Google</a>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup> To test my hypothesis, I chose two &#8220;objects of collection&#8221;, one you can take away (and often is distributed first) and one you can&#8217;t take away: アンケート (<em>ankēto</em> &#8220;survey,&#8221; from the French <em>enquête</em>) and 意見 (<em>iken</em> &#8220;opinion&#8221;). I then took the four resulting collocations<sup id="fnref:1"><a href="#fn:1" rel="footnote">3</a></sup> on Google in quotes (&#8220;•&#8221;) and recorded how many hits there were.</p>

<p><span id="more-721"></span></p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;意見を収集&#8221;</th><th>&#8220;意見を回収&#8221;</th><th>&#8220;アンケートを収集&#8221;</th><th>&#8220;アンケートを回収&#8221;</th></tr>
<tr><td>218000</td><td>6200</td><td>784</td><td>169000</td></tr>
</table>

<p>A better way to organize this data is as follows:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;↓を→&#8221;</th><th>回収</th><th>収集</th></tr>
<tr><th>アンケート</th><td>16900</td><td>784</td></tr>
<tr><th>意見</th><td>6200</td><td>218000</td></tr>
</table>

<p>This data clearly supports the hypothesis I laid out above: アンケート, which can be taken away from people and is often distributed first, occurs much more likely with 回収 than 収集. 意見, on the other hand, which crucially cannot be taken away when collected, occurs much more likely with 収集 than 回収.</p>

<p>While this one example doesn&#8217;t <em>prove</em> anything in and of itself, it does help clarify with data a nuance between two near synonyms. While my hypothesis was borne out here, native speaker intuitions on word nuances and distinctions can be unreliable.<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup> This type of quick test can be very helpful for language learners and instructors alike.</p>

<p>Languages very often have words which vary in very subtle ways. Just this Tuesday I went to a <a href="http://linguistic.meetup.com/58/">Tokyo Language Exchange Meetup</a>, a great <a href="http://en.wikipedia.org/wiki/meetup.com">meetup</a> which brought together various language learners and enthusiasts. A hot topic that night was words with very similar meanings—near synonyms. A few English learners were lamenting sets of words like {see, view, watch} and how difficult they are to learn. I myself have had the same experience studying Mandarin.</p>

<p>I noted that these difficulties in offering contrasting definitions often are due to the fact that word meanings are not just &#8220;what the word points to&#8221; but also the implication of &#8220;what it relates to&#8221;.<sup id="fnref:5"><a href="#fn:5" rel="footnote">5</a></sup> For example, &#8220;unborn baby&#8221; and &#8220;fetus&#8221; may point to the same thing, but are used in different contexts, in contrast to different other terms, for differing effect. Similarly &#8220;Death Tax&#8221; and &#8220;Estate Tax.&#8221; &#8220;Kneel&#8221; and &#8220;genuflect.&#8221;<sup id="fnref:6"><a href="#fn:6" rel="footnote">6</a></sup></p>

<p>The concept of word meanings being &#8220;what it points to&#8221; and &#8220;what it relates to&#8221; also helps explain why certain words are difficult to translate. Fillmore uses the Japanese example of ぬるい (<em>nurui</em>) which is the de facto translation of &#8220;lukewarm.&#8221; However, some Japanese speakers will only use ぬるい in contrast with &#8220;hot,&#8221; i.e., hot tea can become ぬるい over time but ice water does not become ぬるい. In contrast, English &#8220;lukewarm&#8221; can be used to describe things that are initially or prototypically hot or cold. &#8220;What the words point to&#8221; in this case is the same but &#8220;what it relates to&#8221; or, here, &#8220;what it contrasts with&#8221; is different, making it an imperfect (though very close) translation.</p>

<p>Every language has near synonyms which vary slightly in nuance but this nuance or &#8220;feeling&#8221; is borne out objectively in data. Looking at what words certain terms relate to <em>in real usage</em> is often the key to getting a richer understanding of vocabulary.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:3">
<p>This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (<em>mawa=ru</em>) can mean &#8220;circle back.&#8221;&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the <a href="http://en.wikipedia.org/wiki/n-gram">n-gram</a> <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">data they offer</a> for research.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p><a href="http://en.wikipedia.org/wiki/collocation">&#8221;Collocation&#8221; on Wikipedia</a> says: &#8220;Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.&#8221;&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>Hm&#8230; I just made a claim&#8230; looking for a citation.&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:5">
<p>&#8220;Relates to&#8221; here is not meant in an etymological sense. In <a href="http://en.wikipedia.org/wiki/frame semantics (linguistics)">frame semantics</a>, a part of <a href="http://en.wikipedia.org/wiki/cognitive linguistics">cognitive linguistics</a>, the &#8220;what the word points to&#8221; may be called a <strong>profile</strong> while the &#8220;what it relates to&#8221; is called the <strong>(semantic) frame</strong>. These distinctions are due to the work of <a href="http://en.wikipedia.org/wiki/Charles J. Fillmore">Fillmore</a> 1976.&#160;<a href="#fnref:5" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:6">
<p>The great examples in this section come from Bill Croft and D. Alan Cruse&#8217;s <em>Cognitive Linguistics</em>, 2004&#160;<a href="#fnref:6" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Testing Google&#8217;s Language Detection</title>
		<link>http://mitcho.com/blog/observation/testing-googles-language-detection/</link>
		<comments>http://mitcho.com/blog/observation/testing-googles-language-detection/#comments</comments>
		<pubDate>Sat, 17 May 2008 09:47:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[Chinese characters]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[computational linguistics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language detection]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=254</guid>
		<description><![CDATA[As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/google-code.png" alt="google code" title="google-code" width="156" height="57" /></p>

<p>As <a href="http://googleblog.blogspot.com/2008/05/google-translate-adds-10-new-languages.html">Google adds ten more languages to its machine translation service</a>, it seems to be on its way to becoming the most convenient <a href="http://en.wikipedia.org/wiki/universal translator">universal translator</a> of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of queries in Chinese character (<a href="http://en.wikipedia.org/wiki/Chinese characters">漢字/汉字</a>) scripts. In this post, I run some tests using Google&#8217;s <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">Language Detection service</a> to try to better understand its handling of Chinese character queries.</p>

<h3>Background</h3>

<p>Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the &#8220;CJKV&#8221;). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there&#8217;s <a href="http://en.wikipedia.org/wiki/Chinese character">Wikipedia</a> or <a href="http://books.google.com/books?id=htlttpi1KOoC">Ken Lunde&#8217;s CJKV Information Processing</a>.</p>

<h3>The problem</h3>

<p>The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, <strong>Google doesn&#8217;t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.</strong><sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<p><span id="more-254"></span></p>

<p>Take, for example, a query like &#8220;七面鳥&#8221;, meaning &#8216;turkey&#8217; in Japanese. While all characters are very common in traditional Chinese (鳥 is simplified to 鸟 in China), the combination &#8220;七面鳥&#8221; is quite rare in Chinese. However, when you search for &#8220;七面鳥,&#8221; many of the first results are in Chinese and only two of the first ten results are in Japanese.</p>

<p>Does Google&#8217;s corpus not identify &#8220;七面鳥&#8221; as a primarily Japanese word? Google does indeed attest to this fact: searching for &#8220;七面鳥&#8221; and limiting to a certain language yields the following number of hits. A similar effect can be seen with Japanese words such as &#8220;芝生&#8221; (&#8216;grass&#8217;) or &#8220;泥棒&#8221; (&#8216;burglar&#8217;). The &#8220;Japanese on first page&#8221; column gives the number of results that are in Japanese which come up in a language-unspecified search from the US.</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (simplified)</th><th> Chinese (traditional)</th><th> Japanese </th><th>Japanese on<br />first page</th></tr>
<tr><th>七面鳥</th><td>786</td><td>926</td><td>395,000</td><td>2/10</td></tr>
<tr><th>芝生</th><td>55,600</td><td>216,000</td><td>2,230,000</td><td>0/10</td></tr>
<tr><th>泥棒</th><td>13,500</td><td>22,500</td><td>10,400,000</td><td>3/10</td></tr>
</table>

<p>In a perfect world, I would like Google to <strong>identify the language that the query is in</strong>, and then <strong>weigh results that are in that language higher</strong> in the results list. So the issue comes down to one of <strong>language detection</strong>.</p>

<iframe src="http://www.google.com/uds/samples/language/detect.html" width='400px' height="200px"></iframe>

<p>There are broadly two different approaches to language detection and, indeed, all natural language processing problems: <em>parsing</em> and <em>counting</em>. In this case, parsing involves trying to break apart the query into words and then computing how likely such a string of <em>words</em> is in each given language. Counting simply takes an inventory of the characters given and compares them to their frequencies in each language, computing how likely such a string of <em>characters</em> is in each language. Parsing is the &#8220;smarter&#8221; approach, but more difficult and computationally intensive.</p>

<p>Google was kind enough to give us an <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">language detection AJAX service</a> so we can get a sense for how their language detection works. This service also gives a &#8220;confidence&#8221; value on the detection result. For the rest of this entry, we&#8217;ll test some hypotheses against this service and conclude at the end.</p>

<h3>Do spaces matter?</h3>

<p><strong>No.</strong> While spaces are sometimes used in Japanese and Chinese writing to represent word boundaries, especially around numbers and roman letters, they also are seen on the web to encourage line breaks. It would make sense for Google&#8217;s language detection service to ignore spaces in Chinese character queries and that does seem to be the case. All tests I ran with Chinese character queries gave the same result with same confidence with and without spaces in random places.</p>

<h3>Does order matter?</h3>

<p><strong>No.</strong> This was slightly disappointing to see. I took the Japanese string &#8220;骨粗鬆症&#8221; (&#8216;osteoporosis&#8217;, if you&#8217;re curious) and ran every permutation against the language detector and got the same results, including the same confidence values. This is a clear indicator that Google uses only counting, not parsing, in their parser.</p>

<h3>Does repetition matter?</h3>

<p><strong>Yes.</strong> Now that it seems that Google does not use any parsing and only uses character frequencies in identifying the source language, let&#8217;s see how repetition can affect the detection service.</p>

<p>First, I took some Chinese character strings and ran them through the detection service with different numbers of repetitions, e.g. &#8220;参加&#8221;, &#8220;参加参加&#8221;, &#8220;参加参加参加&#8221;, &#8220;参加参加参加参加&#8221;&#8230; The queries I used were the following:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (traditional)</th><th>Japanese</th><th>Chinese (simplified)</th></tr>
<tr><th>木</th><td>X</td><td>X</td><td>X</td></tr>
<tr><th>漢字</th><td>X</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>氣</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>參加</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>参加</th><td>&nbsp;</td><td>X</td><td>X</td></tr>
<tr><th>気</th><td>&nbsp;</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>气</th><td>&nbsp;</td><td>&nbsp;</td><td>X</td></tr>
</table>

<p>For each token type, the detection service made up its mind quite quickly. Its confidence, however, was more interesting.</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-7.png" alt="" title="repetition vs. confidence" /></center></p>

<p>Each of the confidence values dips sharply after three, five, or ten repetitions. Note, however, the length of the tokens which dipped at each of those points. I interpret this to mean that <strong>there is a different parser for less than ten characters and ten or more characters.</strong> However, the detection service did not change its answer after this point on any of the tokens.</p>

<p>Second, I took two characters, &#8220;簡&#8221; and &#8220;体,&#8221; and crossed different numbers of them together to see how that would affect the language detected. Note that &#8220;簡&#8221; is used in traditional Chinese and Japanese, while &#8220;体&#8221; is used in simplified Chinese and Japanese.</p>

<p><style type="text/css">
table .zh { background-color: #e3d2d2; }
table .zh-Hant { background-color: #d3e3d2; }
table .ja { background-color: #d5d2e3; }
</style></p>

<table style="margin-left:auto;margin-right:auto;">
<tr><th>&nbsp;</th><th>簡x0</th><th>簡x1</th><th>簡x2</th><th>簡x3</th><th>簡x4</th><th>簡x5</th><th>簡x6</th><th>簡x7</th><th>簡x8</th><th>簡x9</th></tr>
<tr><th>体x0</th><td>&nbsp;</td> <td class='zh'>0.995</td> <td class='zh'>0.998</td> <td class='zh'>0.998</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x1</th><td class='zh-Hant'>0.995</td> <td class='ja'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x2</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.531</td> </tr>
<tr><th>体x3</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.52</td> <td class='ja'>0.568</td> </tr>
<tr><th>体x4</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.516</td> <td class='ja'>0.565</td> <td class='ja'>0.613</td> </tr>
<tr><th>体x5</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.512</td> <td class='ja'>0.561</td> <td class='ja'>0.609</td> <td class='ja'>0.657</td> </tr>
<tr><th>体x6</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.507</td> <td class='ja'>0.556</td> <td class='ja'>0.605</td> <td class='ja'>0.653</td> <td class='ja'>0.702</td> </tr>
<tr><th>体x7</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.502</td> <td class='ja'>0.551</td> <td class='ja'>0.6</td> <td class='ja'>0.649</td> <td class='ja'>0.697</td> <td class='ja'>0.746</td> </tr>
<tr><th>体x8</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>1</td> <td class='ja'>0.545</td> <td class='ja'>0.595</td> <td class='ja'>0.644</td> <td class='ja'>0.693</td> <td class='ja'>0.741</td> <td class='ja'>0.79</td> </tr>
<tr><th>体x9</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>1</td> <td class='ja'>0.539</td> <td class='ja'>0.589</td> <td class='ja'>0.638</td> <td class='ja'>0.687</td> <td class='ja'>0.736</td> <td class='ja'>0.785</td> <td class='ja'>0.834</td> </tr>
</table>

<table style="margin-left:auto;margin-right:auto;">
<tr><td class="ja">Japanese</td><td class='zh-Hant'>Chinese (traditional)</td><td class='zh'>Chinese (simplified)</td></tr>
</table>

<h3>Conclusion</h3>

<p>For Chinese character-based languages, Google&#8217;s language detection algorithm uses simple counting rather than parsing, identifying languages by looking at the <em>frequency of characters</em> rather than the <em>frequency of words</em>. As such, the algorithm simply acts as a <strong>script detector, not a language detector.</strong> Moreover, as a simple counting method is used, duplicating characters used in one language but not another can very easily skew the resulting output.</p>

<p>As a trivial aside, it seems that Google&#8217;s algorithm is slightly different for strings less than ten characters, as can be seen in a dip and then rise of confidence values after ten characters.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Just to complicate matters further, there&#8217;s also the issue of where you&#8217;re accessing Google from. For example, accessing from the US (or via my friend <a href="http://support.uchicago.edu/docs/network/vpn/">VPN</a>), a query for the Japanese-simplified &#8220;天気&#8221; seems to only return Japanese pages. However, accessing from Taiwan, Google assumes you may have meant the full-form &#8220;天氣&#8221;, giving you pages with both &#8220;天気&#8221; and &#8220;天氣&#8221;. As a result, Yahoo Japan weather is the first result from the US and third from Taiwan, while Yahoo Taiwan weather is first in Taiwan and doesn&#8217;t even show up from the US. This default character substitution in Taiwan is one of my least-favorite Google &#8220;features.&#8221;<br /><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1-300x256.png" alt="" title="picture-1"/></a><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2-300x256.png" alt="" title="picture-2"/></a><br />Similar effects can most likely be seen between the US and China. In the rest of this post, all queries will be made from the US.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/bailey-won-the-japanese-language-speech-contest/' rel='bookmark' title='Bailey won the Japanese Language Speech Contest'>Bailey won the Japanese Language Speech Contest</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/testing-googles-language-detection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Sign language SuperBowl ad</title>
		<link>http://mitcho.com/blog/link/sign-language-superbowl-ad/</link>
		<comments>http://mitcho.com/blog/link/sign-language-superbowl-ad/#comments</comments>
		<pubDate>Mon, 04 Feb 2008 16:40:50 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[link]]></category>
		<category><![CDATA[ads]]></category>
		<category><![CDATA[joke]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[sign language]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2008/02/04/sign-language-superbowl-ad/</guid>
		<description><![CDATA[I don&#8217;t care much for the game, but always love checking out the SuperBowl ads every year&#8230; this year there was something really cool&#8230; a sign language ad by a deaf group at PepsiCo.1 Very cool. The crew has their own website at Pepsi too: Bob&#8217;s House. &#8220;ad&#8221;, used loosely&#8230; does this ad sell anything?&#160;&#8617; [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
<li><a href='http://mitcho.com/blog/link/concordia-language-villages-twin-cities-expansion/' rel='bookmark' title='Concordia Language Villages&#8217; Twin Cities expansion'>Concordia Language Villages&#8217; Twin Cities expansion</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I don&#8217;t care much for the game, but always love checking out the SuperBowl ads every year&#8230; this year there was something really cool&#8230; a sign language ad by a deaf group at PepsiCo.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup> Very cool.</p>

<p><object width="425" height="355"><param name="movie" value="http://www.youtube.com/v/ffrq6cUoE5A&#038;rel=1"></param><param name="wmode" value="transparent"></param><embed src="http://www.youtube.com/v/ffrq6cUoE5A&#038;rel=1" type="application/x-shockwave-flash" wmode="transparent" width="425" height="355"></embed></object></p>

<p>The crew has their own website at Pepsi too: <a href="http://www.pepsi.com/bobshouse/">Bob&#8217;s House</a>.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>&#8220;ad&#8221;, used loosely&#8230; does this ad sell anything?&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
<li><a href='http://mitcho.com/blog/link/concordia-language-villages-twin-cities-expansion/' rel='bookmark' title='Concordia Language Villages&#8217; Twin Cities expansion'>Concordia Language Villages&#8217; Twin Cities expansion</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/link/sign-language-superbowl-ad/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Patricks Nortons on Tekzillaz</title>
		<link>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/</link>
		<comments>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/#comments</comments>
		<pubDate>Wed, 09 Jan 2008 15:42:49 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[California]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[plural]]></category>
		<category><![CDATA[tech]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2008/01/09/patricks-nortons-on-tekzillaz/</guid>
		<description><![CDATA[I just noticed something on the latest Tekzilla Daily: Patrick Norton, host of Tekzilla and former host of the Screen Savers says &#8220;there&#8217;s a lots to learn here&#8221; (1:28) and then later &#8220;the site you&#8217;re having troubles with&#8221; (1:39). While &#8220;having troubles with&#8230;&#8221; is fine, I believe &#8220;having trouble with&#8230;&#8221; is much more common. As [...]
No related posts.

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I just noticed something on the latest <a href="http://revision3.com/tekzilla/tzdaily/2008-01-09ping">Tekzilla Daily</a>: <a href="http://en.wikipedia.org/wiki/Patrick Norton">Patrick Norton</a>, host of <a href="http://www.tekzilla.com">Tekzilla</a> and former host of <a href="http://en.wikipedia.org/wiki/the Screen Savers">the Screen Savers</a> says &#8220;there&#8217;s a lot<em>s</em> to learn here&#8221; (1:28) and then later &#8220;the site you&#8217;re having trouble<em>s</em> with&#8221; (1:39). While &#8220;having troubles with&#8230;&#8221; is fine, I believe &#8220;having trouble with&#8230;&#8221; is much more common. As for &#8220;a lots to learn,&#8221; however, that&#8217;s definitely out. Is it hyperarticulation? I don&#8217;t know.</p>

<p>Wikipedia notes: &#8220;Norton grew up in the <a href="http://en.wikipedia.org/wiki/Midwest">Midwest</a>, but considers the <a href="http://en.wikipedia.org/wiki/Jersey Shore">Jersey Shore</a> his home&#8230; He currently lives in <a href="http://en.wikipedia.org/wiki/San Francisco, California">San Francisco, California</a>.&#8221; So, is this a Jersey Shore or California thing? I have no idea.</p>
<p>No related posts.</p>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/patricks-nortons-on-tekzillaz/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Setting Language Research to Music</title>
		<link>http://mitcho.com/blog/link/setting-language-research-to-music/</link>
		<comments>http://mitcho.com/blog/link/setting-language-research-to-music/#comments</comments>
		<pubDate>Mon, 24 Dec 2007 08:35:51 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[link]]></category>
		<category><![CDATA[art]]></category>
		<category><![CDATA[babies]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language acquisition]]></category>
		<category><![CDATA[music]]></category>
		<category><![CDATA[University of Chicago]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2007/12/24/setting-language-research-to-music/</guid>
		<description><![CDATA[Via LinguistList: &#8216;Setting Language Research to Music&#8217; is a Newcastle University project whose aim is to compose orchestra and choral music to demonstrate infant perception and production. The first piece of music to emerge from the project, &#8216;Swing Cycle&#8217;, mimics babies&#8217; experience of discovering word boundaries, taking work by Peter Jusczyk and colleagues as a [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/link/concordia-language-villages-twin-cities-expansion/' rel='bookmark' title='Concordia Language Villages&#8217; Twin Cities expansion'>Concordia Language Villages&#8217; Twin Cities expansion</a></li>
<li><a href='http://mitcho.com/blog/life/krashen-the-party/' rel='bookmark' title='Krashen The Party'>Krashen The Party</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Via <a href="http://www.linguistlist.org">LinguistList</a>:</p>

<blockquote>
  <p>&#8216;Setting Language Research to Music&#8217; is a Newcastle University project whose aim 
  is to compose orchestra and choral music to demonstrate infant perception and 
  production. The first piece of music to emerge from the project, &#8216;Swing Cycle&#8217;, 
  mimics babies&#8217; experience of discovering word boundaries, taking work by Peter 
  Jusczyk and colleagues as a starting point.</p>
</blockquote>

<p>It&#8217;s the craziest thing I&#8217;ve seen in a long while&#8230; it reminds me of the Music: Materials and Design course I took a couple years ago. My final project was an electronic composition building a rhythm with political speech samples and echos and cracking noises, representing the hollowness of political rhetoric. It was one of my academic low points at Chicago, for sure.</p>

<p>Maybe it&#8217;s because I&#8217;m an artist, but I&#8217;ve never understood the drive for modern art, including compositions like these. I would much rather listen to some music and read about language acquisition separately&#8230; the motivation to combine the two eludes me.</p>

<p>You can listen to The Swing Cycle and read the lyrics (or their approximation) on the <a href="http://www.ncl.ac.uk/elll/news/item?setting-language-research-to-music-the-swing-cycle">Setting Language Research to Music website</a>.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/link/concordia-language-villages-twin-cities-expansion/' rel='bookmark' title='Concordia Language Villages&#8217; Twin Cities expansion'>Concordia Language Villages&#8217; Twin Cities expansion</a></li>
<li><a href='http://mitcho.com/blog/life/krashen-the-party/' rel='bookmark' title='Krashen The Party'>Krashen The Party</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/link/setting-language-research-to-music/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Eats, shoots, and leaves</title>
		<link>http://mitcho.com/blog/observation/eats-shoots-and-leaves/</link>
		<comments>http://mitcho.com/blog/observation/eats-shoots-and-leaves/#comments</comments>
		<pubDate>Mon, 17 Dec 2007 08:13:41 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[punctuation]]></category>
		<category><![CDATA[The West Wing]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2007/12/17/the-damned-comma/</guid>
		<description><![CDATA[I just read Clause and Effect (via DF), a great editorial discussing commas in the second amendment and their effects on interpretation of the law. I found this timely as Bailey and I just watched Institutional Memory, the penultimate episode of The West Wing, where Toby Ziegler discusses a comma in the fifth amendment&#8217;s takings [...]
No related posts.

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I just read <a href="http://www.nytimes.com/2007/12/16/opinion/16freedman.html">Clause and Effect</a> (via <a href="http://www.daringfireball.net">DF</a>), a great editorial discussing commas in <a href="http://en.wikipedia.org/wiki/Second_Amendment_to_the_United_States_Constitution">the second amendment</a> and their effects on interpretation of the law. I found this timely as Bailey and I just watched <a href="http://en.wikipedia.org/wiki/Institutional_Memory">Institutional Memory</a>, the penultimate episode of <a href="http://www.amazon.com/gp/redirect.html?ie=UTF8&#038;location=http%3A%2F%2Fwww.amazon.com%2Fgp%2Fentity%2FThe-West-Wing%2FB001CFVZI8%3Fie%3DUTF8%26%252AVersion%252A%3D1%26%252Aentries%252A%3D0&#038;tag=mitchocom-20&#038;linkCode=ur2&#038;camp=1789&#038;creative=390957">The West Wing</a><img src="https://www.assoc-amazon.com/e/ir?t=mitchocom-20&amp;l=ur2&amp;o=1" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />, where Toby Ziegler discusses a comma in the <a href="http://en.wikipedia.org/wiki/Fifth_Amendment_to_the_United_States_Constitution">fifth amendment</a>&#8217;s takings clause: &#8220;nor shall private property be taken for public use[,] without just compensation.&#8221; BBC&#8217;s H2G2 has <a href="http://www.bbc.co.uk/dna/h2g2/A28880382">a pretty good write-up</a> and there&#8217;s <a href="http://westwing.bewarne.com/whowhatwhere/comma.html">a listing of relevant links</a> as well.</p>

<p>The funny thing about all of these is that we don&#8217;t <em>speak</em> commas. It&#8217;s used to graphically represent pauses in speech, but are often used according to certain artificial rules which, when used systematically, aim to help the reader parse the sentence or help disambiguate between different readings.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<p>I&#8217;m surprised <a href="http://itre.cis.upenn.edu/~myl/languagelog/">Language Log</a> hasn&#8217;t picked up this new piece yet. UPDATE: Yup, they got to it. <a href="http://itre.cis.upenn.edu/~myl/languagelog/archives/005229.html">Great coverage</a>, as always.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>We use pauses in spoken language to do this too, but not necessarily in the same places that we place commas in &#8220;good&#8221; written language.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>No related posts.</p>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/eats-shoots-and-leaves/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

