<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; Mandarin</title>
	<atom:link href="http://mitcho.com/blog/tag/mandarin/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Thu, 29 Jul 2010 19:14:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Spring is for Speaking: JSConf, WordCamp SF, IACL</title>
		<link>http://mitcho.com/blog/projects/spring-is-for-speaking/</link>
		<comments>http://mitcho.com/blog/projects/spring-is-for-speaking/#comments</comments>
		<pubDate>Sat, 20 Mar 2010 04:37:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[projects]]></category>
		<category><![CDATA[Boston]]></category>
		<category><![CDATA[Chinese language]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[harvard]]></category>
		<category><![CDATA[JavaScript]]></category>
		<category><![CDATA[Jetpack]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[San Francisco]]></category>
		<category><![CDATA[talk]]></category>
		<category><![CDATA[Washington D.C.]]></category>
		<category><![CDATA[WordCamp]]></category>
		<category><![CDATA[WordPress]]></category>
		<category><![CDATA[WordPress Planet]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=3448</guid>
		<description><![CDATA[I recently confirmed three different very exciting speaking gigs which I&#8217;ll be doing this spring: JSConf.us: I&#8217;ll be putting my Mozilla Jetpack Ambassador hat on to represent Mozilla Labs&#8217; Jetpack project at the premier Javascript conference in North America, JSConf.us, which this year will be April 17-18 in Washington D.C. and has a pirate theme.1 [...]


Related posts:<ol><li><a href='http://mitcho.com/blog/life/wordcamp-boston-2010/' rel='bookmark' title='Permanent Link: WordCamp Boston 2010'>WordCamp Boston 2010</a></li>
<li><a href='http://mitcho.com/blog/life/travel/linguistics-in-%e5%98%89%e7%be%a9/' rel='bookmark' title='Permanent Link: Linguistics in 嘉義'>Linguistics in 嘉義</a></li>
<li><a href='http://mitcho.com/blog/projects/mashing-up-the-browser-in-maine/' rel='bookmark' title='Permanent Link: Mashing up the browser in Maine'>Mashing up the browser in Maine</a></li>
</ol>

Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I recently confirmed three different very exciting speaking gigs which I&#8217;ll be doing this spring:</p>

<p><span id="more-3448"></span></p>

<p><strong>JSConf.us</strong>:</p>

<p>I&#8217;ll be putting my Mozilla Jetpack Ambassador hat on to represent Mozilla Labs&#8217; <a href="https://jetpack.mozillalabs.com/">Jetpack project</a> at the premier Javascript conference in North America, <a href="http://jsconf.us/2010/">JSConf.us</a>, which this year will be April 17-18 in Washington D.C. and has a pirate theme.<sup id="fnref:2"><a href="#fn:2" rel="footnote">1</a></sup> I&#8217;ll be giving a short talk in the main session and will also lead a hands-on Jetpack workshop in the hacker lounge. I&#8217;ve heard that JSConf is a lot of fun and I&#8217;m really looking forward to it! <img src='http://mitcho.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>

<p><strong>WordCamp San Francisco</strong>:</p>

<p>I&#8217;m honored to have been invited to give a talk at <a href="http://2010.sf.wordcamp.org/">WordCamp San Francisco 2010</a>. WordCamps are community-organized events for the <a href="http://wordpress.org">WordPress</a> community, and the San Francisco WordCamp is the original and biggest. WordCamp SF will be at the Mission Bay Conference Center on May 1st. <a href="https://2010.sf.wordcamp.org/tickets/">Tickets available</a>.</p>

<p>My talk is tentatively titled &#8220;Abstract Your Code.&#8221;<sup id="fnref:1"><a href="#fn:1" rel="footnote">2</a></sup> WordPress is a great platform to build amazing content-rich applications on, and many of us have written new functionality in the form of plugins. I hope to encourage developers to make their code more portable and reusable after the project is done—or, ideally, to even start with abstraction in mind—to add to the &#8220;life&#8221; of the code and to consider then open-sourcing that functionality.</p>

<p>Hope to see you there!</p>

<p><strong>International Association of Chinese Linguistics (IACL) 18</strong>:</p>

<p>Finally, I&#8217;m thrilled to say that I got a paper accepted to the <a href="http://www.fas.harvard.edu/~iacl18/Site/index.html">annual meeting of the International Association of Chinese Linguistics</a> which this year is at Harvard on May 20-22. IACL is <em>the</em> big conference for Chinese linguistics, with about <a href="http://www.fas.harvard.edu/~IACL18/AcceptList.pdf">180 papers presenting</a>. I&#8217;ll be presenting <em>Two</em> Only<em>s in Mandarin Chinese</em>, my recent work on the formal syntax/semantics of two <em>only</em> words in Chinese: <em>zhǐ</em> (只) and <em>éryǐ</em> (而已). I&#8217;ve put up <a href="http://mitcho.com/academic/handout-20100226.pdf">a handout</a> of some of this material in work-in-progress form which I recently presented at <a href="http://people.fas.harvard.edu/~nicolae/SNEWS_2010/Welcome.html">SNEWS</a>.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:2">
<p>I&#8217;ll <a href="http://beijinghuar.com">fit right in</a>.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p>Sexier title suggestions welcome.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>


<p>Related posts:<ol><li><a href='http://mitcho.com/blog/life/wordcamp-boston-2010/' rel='bookmark' title='Permanent Link: WordCamp Boston 2010'>WordCamp Boston 2010</a></li>
<li><a href='http://mitcho.com/blog/life/travel/linguistics-in-%e5%98%89%e7%be%a9/' rel='bookmark' title='Permanent Link: Linguistics in 嘉義'>Linguistics in 嘉義</a></li>
<li><a href='http://mitcho.com/blog/projects/mashing-up-the-browser-in-maine/' rel='bookmark' title='Permanent Link: Mashing up the browser in Maine'>Mashing up the browser in Maine</a></li>
</ol></p>
<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/spring-is-for-speaking/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Exploring Command Chaining in Ubiquity: Part 2</title>
		<link>http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-2/</link>
		<comments>http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-2/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 23:14:07 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[Mandarin]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[natural syntax]]></category>
		<category><![CDATA[serial verb construction]]></category>
		<category><![CDATA[syntax]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=2799</guid>
		<description><![CDATA[Introduction I recently have begun giving serious thought to what command chaining might look like in Ubiquity and the various considerations which must be made to make it happen. The &#8220;command chaining,&#8221; or &#8220;piping,&#8221; described here always involves (at least) two verbs acting sequentially on a passed target—that is, the first command performs some action [...]


Related posts:<ol><li><a href='http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-1/' rel='bookmark' title='Permanent Link: Exploring Command Chaining in Ubiquity: Part 1'>Exploring Command Chaining in Ubiquity: Part 1</a></li>
<li><a href='http://mitcho.com/blog/link/command-chaining-with-oni/' rel='bookmark' title='Permanent Link: Command Chaining with Oni?'>Command Chaining with Oni?</a></li>
<li><a href='http://mitcho.com/blog/link/ubiquity-in-italian/' rel='bookmark' title='Permanent Link: Ubiquity in Italian'>Ubiquity in Italian</a></li>
</ol>

Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>

<p>I recently have begun giving serious thought to what <strong>command chaining</strong> might look like in Ubiquity and the various considerations which must be made to make it happen. The &#8220;command chaining,&#8221; or &#8220;piping,&#8221; described here always involves (at least) two verbs acting sequentially on a passed target—that is, the first command performs some action or lookup and the second command acts on the first command&#8217;s output.</p>

<p>A few days ago I penned some initial <a href="http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-1/">technical considerations regarding command chaining</a>. In this post I&#8217;ll be point out some linguistic considerations involved in supporting a <a href="http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/">natural syntax</a> for chaining.</p>

<p><span id="more-2799"></span></p>

<h3>Simple syntaxes: sequential vs embedding strategies</h3>

<p>When it comes to creating a natural language interface, there&#8217;s always a decision to make between requiring a certain kind of input, or working a little harder to understand the user&#8217;s natural input. From an implementation point of view, adopting certain programmatic conventions is of course simpler and to this end, there have been a couple different &#8220;unnatural&#8221; command chaining syntaxes suggested. While these both go against Ubiquity&#8217;s basic tenet of <a href="http://mitcho.com/blog/projects/how-natural-should-a-natural-interface-be/">natural syntax</a> — that is, to not introduce rules which contradict the user&#8217;s natural language — which gives Ubiquity its strengths of usability and memorability, I&#8217;ll entertain them here as they illustrate two different structural relationships that we will want to consider.</p>

<p><a href='http://www.threadless.com/product/543/This_is_not_a_Pipe?streetteam=mitcho'><img src="http://mitcho.com/blog/wp-content/uploads/2009/08/not-pipe.gif" alt="not-pipe.gif" border="0" width="480" height="329" /></a></p>

<p>The first suggestion is to adopt the <a href="http://en.wikipedia.org/wiki/Pipeline_(Unix)">shell pipe</a> (|), which would lead to input such as</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
</pre></td><td class="code"><pre class="ubiquity" style="font-family:monospace;">translate hello to Spanish | email to Jono</pre></td></tr></table></div>


<p>While this itself is pretty unnatural unless you speak shell, note that this syntax is similar to the more natural &#8220;, and&#8221; syntax, yielding <code>translate hello to Spanish, and email to Jono</code>, which we will consider below. I&#8217;ll refer to this strategy as the <strong>sequential</strong> strategy.</p>

<p>Another <a href="http://www.croczilla.com/blog/16">very interesting proposal</a> by Alex Fritze is to embed each subordinate computation into an argument position, marked by parentheses. This could also be parsed relatively straightforwardly by writing a noun type which first checks for parentheses and then runs the content of the argument through another <a href="http://ubiquity.mozilla.com/trac/ticket/532">ParseQuery</a>.</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>2
</pre></td><td class="code"><pre class="ubiquity" style="font-family:monospace;">email (translate hello to Spanish) to Jono</pre></td></tr></table></div>


<p>I&#8217;ll refer to this pattern as the <strong>embedding</strong> strategy.</p>

<h3>Sequential and embedding strategies in natural language</h3>

<p>What&#8217;s interesting about the two proposals above is that both strategies are seen in natural language. The sequential strategy could correspond to the following linguistic phenomena:</p>

<ol>
<li><a href="http://en.wikipedia.org/wiki/coordination (linguistics)">coordination</a>: a non-hierarchical joining of two or more <a href="http://en.wikipedia.org/wiki/clauses (linguistics)">clauses</a>, often marked by a <a href="http://en.wikipedia.org/wiki/conjunction">conjunction</a>. Here&#8217;s an example from English:

<ul>
<li>&#8220;[I made a sandwich] and [you will eat it]&#8221; where [] represent clause boundaries. Here, &#8220;and&#8221; is the conjunction.</li>
</ul></li>
<li><a href="http://en.wikipedia.org/wiki/serial verb construction">serial verb</a> and <a href="http://en.wikipedia.org/wiki/converb">converb</a> constructions: a joining of multiple verbs or verb phrases within a single clause, with shared subject and tense/aspect values, with no particular conjugation or delimiter between them. Such constructions are common in many African and east Asian languages. Here are two examples:<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup><br/></li>
</ol>

<ul><li>A converbal construction in Japanese:<br/>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>3
4
5
</pre></td><td class="code"><pre class="ja" style="font-family:monospace;">僕は     サンドイッチを 作って     食べる
boku-wa sandiʔchi-o  tsuku-ʔte tabe-ru
I-TOP   sandwich-ACC make-CON  eat</pre></td></tr></table></div>


<br/>&#8220;I (will) make a sandwich and eat [it].&#8221; (Here, `TOP` = topic, `ACC` = accusative, `CON` = converbal ending)[^3]</li>
<li>A serial verb construction in Mandarin Chinese:<br/>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>6
7
8
</pre></td><td class="code"><pre class="zh" style="font-family:monospace;">我 作   三明治      吃
wǒ zùo  sānmíngzhì chī
I  make sandwich   eat</pre></td></tr></table></div>


<br/>&#8220;I (will) make a sandwich and eat [it]&#8221; or &#8220;I (will) make a sandwich [in order to] eat [it].&#8221;</li></ul>

<p><br/>Note that in both the converb and serial verb construction, the second verb (eat) takes shares its object (sandwich) with the first verb and there is no need for a pronoun such as &#8220;it&#8221; to introduce that argument as it is with coordination, above.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup></p>

<p>The embedding strategy is observed in natural language as well, in the form of the following phenomena:</p>

<ol>
<li><a href="http://en.wikipedia.org/wiki/embedded clauses">embedded clauses</a>: a sentence is itself the argument of another verb. Example:<br/></li>
</ol>

<ul><li>&#8220;John says [he likes sandwiches].&#8221;</li></ul>

<p><br/>Embedded clauses, however, clearly have no relation to command chaining and does not require our attention.
2. <a href="http://en.wikipedia.org/wiki/relative clauses">relative clauses</a>: a partial sentence<sup id="fnref:5"><a href="#fn:5" rel="footnote">3</a></sup> is attached to a noun in order to describe it or distinguish it from other possible referents. Example:</p>

<ul><li>&#8220;You ate the sandwich that I made&#8221; where &#8220;sandwich&#8221; is called the &#8220;head&#8221; of the relative clause, and &#8220;I made&#8221; is what I here call the &#8220;partial sentence&#8221; (see footnote). The &#8220;relative clause&#8221; is used here to distinguish &#8220;the sandwich that I made&#8221; from other sandwiches.</li></ul>

<h3>The natural syntax of chaining</h3>

<p>So <strong>which strategy is used in complex natural language commands:</strong> the sequential strategy or the embedding strategy? Both the sequential strategy and embedding strategy can be involved with commands:</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>9
10
</pre></td><td class="code"><pre class="en" style="font-family:monospace;">[Make a sandwich] and [eat it]!
Eat (the sandwich that I made)!</pre></td></tr></table></div>


<p>These two commands do not mean the same thing, though, and only (9) is the kind of command we would want to give Ubiquity. The problem with relative clauses, as in (10), is that it <em><a href="http://en.wikipedia.org/wiki/presupposition">presupposes</a> the existence of the sandwich in the context</em>. If we both know you just made a sandwich, saying (10) is perfectly appropriate, but out of the blue it doesn&#8217;t make sense. For this reason, <strong>only the sequential strategy is used in the natural syntax of chaining.</strong></p>

<h3>Parsing the sequential strategy</h3>

<p>In natural language, unlike the initial simple proposals laid out above, there is often no clear delimiter marking the boundary between the two parts in a sequential relation (e.g. examples (3) and (6) above, particularly given that neither Japanese and Chinese normally break words with spaces). <strong>How would we parse a sequential string of commands?</strong></p>

<p>Let&#8217;s assume for our purposes here that we can identify find all verbs within the input string.<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup> Parsing a sequential strategy string is not particularly difficult if we can also assume that the verb in any particular language is either always verb-initial or always verb-final. Let&#8217;s look at both cases:</p>

<ul>
<li>Always verb-initial: Mandarin Chinese:<br/></li>
</ul>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>11
12
13
</pre></td><td class="code"><pre class="zh" style="font-family:monospace;">翻譯       hello 到  西班牙語   送    給  Juanito
fānyì     hello dào xībānyáyǔ sòng gěi Juanito
translate hello to  Spanish   send to  Juanito</pre></td></tr></table></div>


<p><br/>
&#8220;Translate hello to Spanish [and] send [it] to Juanito&#8221;
  1. find every possible verb:<br/><strong>翻譯</strong>hello到西班牙語<strong>送</strong>給Juanito
  2. as every verb marks the beginning of a sentence, we now have our two commands: &#8220;<strong>翻譯</strong>hello到西班牙語&#8221; (translate hello to Spanish) and &#8220;<strong>送</strong>給Juanito&#8221; (send to Juanito).
* Always verb-final: Japanese<br/></p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>14
15
16
</pre></td><td class="code"><pre class="ja" style="font-family:monospace;">helloを スペイン語に 訳して Juanitoに 送って
hello-o supeingo-ni yakus-ite Juanito-ni oku-ʔte
hello-ACC Spanish-DAT translate-CON Juanito-DAT send-CON</pre></td></tr></table></div>


<p><br/>
&#8220;Translate hello to Spanish [and] send [it] to Juanito&#8221;
  1. find every possible verb:<br/>helloをスペイン語に<strong>訳して</strong>Juanitoに<strong>送って</strong>
  2. as every verb marks the end of a sentence, we now have our two commands: &#8220;helloをスペイン語に<strong>訳して</strong>&#8221; (translate hello to Spanish) and &#8220;Juanitoに<strong>送って</strong>&#8221; (send to Juanito).</p>

<p>For languages where there is a clear conjunction between the two commands, such as English &#8220;and&#8221;, we can also use that conjunction as a delimiter as well. We then simply execute the first command and then execute the second with the first command&#8217;s output in its interpolation context. This way the output of the first command will be picked up both by an overt pronoun such as &#8220;it&#8221; in the second command and without it, such as in the Chinese and Japanese examples above.<sup id="fnref:6"><a href="#fn:6" rel="footnote">5</a></sup></p>

<p>The only potential problem with this approach is in the case of languages where some commands are verb-initial while others are verb-final. I note that such languages do exist in a previous blog post, <a href="http://mitcho.com/blog/observation/wheres-the-verb/">Where&#8217;s The Verb</a>. In these languages, commands can be expressed by more than one verb form (such as infinitive, imperative, subjunctive, etc.) and some of those verb forms are sentence-initial while others are sentence-final. Here&#8217;s one such example from German:</p>

<p>&#8220;search hello with google&#8221; (German)
1. Infinitive: hello mit google suchen
2. Imperative: suche hello mit google</p>

<p>Here the verb for &#8220;search&#8221; is &#8220;suchen&#8221; (infinitive) or &#8220;suche&#8221; (imperative). I know that this same type of phenomena occurs in other Germanic languages such as Dutch with infinitive and imperative and also other languages such as Modern Greek with infinitive and subjunctive forms. <strong>If you are a speaker of one of these lanuages (German, Dutch, Greek, etc.) I would love to know whether you can chain verb-final and verb-initial commands together.</strong></p>

<h3>Conclusion</h3>

<p>In this blog post I examined command chaining in natural language, focusing on data from English, Mandarin Chinese, and Japanese, which exhibit three linguistically different approaches to chaining. What we found is that the sequential strategy—that of listing the commands one by one, in order of execution—is what is used in natural languages, rather than any sort of embedding. This fact, combined with the fact that our parser can recognize every available verb, offers a simple approach to doing a naive parse of natural command chains in most languages, even without explicit delimiters.</p>

<p>In a final installation of this series on &#8220;exploring command chaining,&#8221; I hope to consider how the Ubiquity interface itself could present command chains and aid in its input.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>The distinction between serial verb and converb constructions (as well as other forms of complex predication) hinge on structural distinctions which are not of importance for our purposes.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Some people (<a href="http://www.jstor.org/stable/4178644">Baker 1989</a> and others), in fact, list this object sharing as a necessary part of the notion of a &#8220;serial verb construction.&#8221;&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:5">
<p>&#8220;Partial sentence&#8221; is used in a descriptive sense here to reflect that the relative clause, such as &#8220;I made&#8221; in the example given cannot stand as its own sentence, as the verb&#8217;s argument is clearly missing. This type of pattern is also seen in questions (&#8220;What did [you make]?&#8221;) and topicalization (&#8220;That sandwich, [I made].&#8221;) and is a great focus of theoretical linguistics research. See <a href="http://en.wikipedia.org/wiki/wh-movement">wh-movement</a> on wikipedia for more examples and information on theoretical approaches to such constructions.&#160;<a href="#fnref:5" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>We don&#8217;t do this right now as there hasn&#8217;t been a use for it—right now <a href="https://wiki.mozilla.org/Labs/Ubiquity/Parser_2">Parser 2</a> simply looks for known verbs at the beginning and end of the input. The parser does build a nice regular expression to find known verbs, however, so finding verbs input-medially would also be easy to do, though.&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:6">
<p>Note that even though the linguistic relation between the two commands is non-hierarchical, we interpret the sentences to mean &#8220;translate hello to Spanish and <em>then</em> email it to Juanito&#8221;, rather than &#8220;translate hello to Spanish and email it (hello) to Juanito <em>at the same time</em>.&#8221; This observed universal property that, ceteris paribus, the linear speech order of verbs reflects the conceptual order of events is known as the Temporal Iconicity Condition (<a href="http://www.jstor.org/pss/416696">Li 1993</a> and others).&#160;<a href="#fnref:6" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>


<p>Related posts:<ol><li><a href='http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-1/' rel='bookmark' title='Permanent Link: Exploring Command Chaining in Ubiquity: Part 1'>Exploring Command Chaining in Ubiquity: Part 1</a></li>
<li><a href='http://mitcho.com/blog/link/command-chaining-with-oni/' rel='bookmark' title='Permanent Link: Command Chaining with Oni?'>Command Chaining with Oni?</a></li>
<li><a href='http://mitcho.com/blog/link/ubiquity-in-italian/' rel='bookmark' title='Permanent Link: Ubiquity in Italian'>Ubiquity in Italian</a></li>
</ol></p>
<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/exploring-command-chaining-in-ubiquity-part-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Three ways to argue over arguments</title>
		<link>http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/</link>
		<comments>http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 03:26:05 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[agreement]]></category>
		<category><![CDATA[ambiguity]]></category>
		<category><![CDATA[Ancient Greek]]></category>
		<category><![CDATA[argument structure]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[case]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[coding properties]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[grammatical relations]]></category>
		<category><![CDATA[Hungarian]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[ubiquity]]></category>
		<category><![CDATA[verbs]]></category>
		<category><![CDATA[word order]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1413</guid>
		<description><![CDATA[UPDATE: Contribute information on how your language identifies its arguments here. When we execute a command in Ubiquity, in very simple terms, we&#8217;re hoping to do something (a verb) to some arguments (the nouns). Every sentence in every language uses some method to encode which arguments correspond to which roles of the verb. Here are [...]


Related posts:<ol><li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Permanent Link: Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Permanent Link: Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Permanent Link: Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>

Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><em>UPDATE: Contribute information on how your language identifies its arguments <a href="http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/">here</a>.</em></p>

<p>When we execute a command in Ubiquity, in very simple terms, we&#8217;re hoping to do something (a verb) to some arguments (the nouns). Every sentence in every language uses some method to encode which arguments correspond to which roles of the verb. Here are a couple examples:</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>1
2
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">He sees Mary.
彼が Maryを 見る。 (Kare-ga Mary-o miru.)</pre></td></tr></table></div>


<p>As speakers of English, you can read sentence (1) above and know exactly who is doing the seeing and who is being seen and speakers of Japanese can get the same information from (2). <strong>How do different languages code for arguments in different roles?</strong> There are, broadly speaking, three different ways:</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2009/02/threeways.png" alt="three ways to code for arguments in different roles" border="0" width="536" height="284" /></center></p>

<p>We&#8217;ll take a brief look today at these three different strategies, all of which <a href="http://www.azarask.in/blog/post/scaling-ubiquity-to-60-languages-we-need-your-help/">a localizeable natural language interface</a> will surely encounter.</p>

<p><span id="more-1413"></span></p>

<h3>Word order</h3>

<p>In many languages, the position of the arguments relative to one another and to the verb determine the roles which each argument will play. Mandarin Chinese is a good example of such a language:</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>3
4
</pre></td><td class="code"><pre class="chinese" style="font-family:monospace;">他 喜欢 Mary (Ta xihuan Mary)
Mary 喜欢 他 (Mary xihuan ta)</pre></td></tr></table></div>


<p>Here, sentence (3) says &#8220;he likes Mary&#8221; while sentence (4) says &#8220;Mary likes him&#8221;. Simply reversing the positions of &#8220;he/him&#8221; and &#8220;Mary&#8221; we&#8217;re able to flip the roles that they fill in the sentence: that of the person who does the liking and the person who is being liked. Now take a look at sentence (5) which means &#8220;John says &#8216;hello&#8217; to Mary.&#8221;</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>5
</pre></td><td class="code"><pre class="chinese" style="font-family:monospace;">John 告诉 Mary &quot;你 好&quot; (John gaosu Mary &quot;ni hao&quot;)</pre></td></tr></table></div>


<p>We note here that, while in English we used a different strategy of marking one argument (we marked the &#8220;hello&#8221; argument with &#8220;to&#8221;), Chinese doesn&#8217;t mark either of the arguments. There is, however, a clearly defined order to the arguments, which you might encode this way:</p>


<div class="wp_syntax"><div class="code"><pre class="code" style="font-family:monospace;">say [who you're speaking to] [what you're saying]</pre></div></div>


<p>If you swap the order of the two objects in this sentence, it becomes ungrammatical. (<strong>Note:</strong> the asterisk * here means the sentence is <em>ungrammatical</em>.)</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>5
</pre></td><td class="code"><pre class="chinese" style="font-family:monospace;">* John 告诉 &quot;你 好&quot; Mary (John gaosu &quot;ni hao&quot; Mary)</pre></td></tr></table></div>


<p>Here, the word order dictates that &#8220;你好&#8221; must be &#8220;who you&#8217;re speaking to&#8221; and &#8220;Mary&#8221; must be &#8220;what you&#8217;re saying,&#8221; but that doesn&#8217;t make sense, so the sentence is ungrammatical.</p>

<h3>Marking the arguments</h3>

<p>Another possible strategy is to mark each argument (or some of the arguments) so that each argument&#8217;s role is clear. In many languages this is done with <a href="http://en.wikipedia.org/wiki/case marking">case marking</a>. Take for example this Ancient Greek sentence with its English gloss on line (6). Here, NOM refers to <a href="http://en.wikipedia.org/wiki/nominative case">nominative case</a> and ACC refers to <a href="http://en.wikipedia.org/wiki/accusative case">accusative case</a>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">1</a></sup></p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>5
6
</pre></td><td class="code"><pre class="ancient-greek" style="font-family:monospace;">ho  didaskal-os  paideuei to  paidi-on  (SVO)
the teacher -NOM teaches  the boy  -ACC</pre></td></tr></table></div>


<p>This sentence means &#8220;the teacher instructs the boy.&#8221; While sentence (5) is in Subject-Verb-Object order, any of the six possible orderings of {subject, verb, object} are also grammatical and mean the same thing:<sup id="fnref:1"><a href="#fn:1" rel="footnote">2</a></sup></p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>7
8
9
10
11
</pre></td><td class="code"><pre class="ancient-greek" style="font-family:monospace;">ho didaskalos to paidion paideuei (SOV)
paideuei ho didaskalos to paidion (VSO)
paideuei to paidion ho didaskalos (VOS)
to paidion ho didaskalos paideuei (OSV)
to paidion paideuei ho didaskalos (OVS)</pre></td></tr></table></div>


<p>Many languages also use <a href="http://en.wikipedia.org/wiki/adposition">adpositions</a> (prepositions and/or postpositions) to further clarify the role of an argument in addition to case (like English does) or in lieu of case marking altogether. The idea is the same, though: you want to clarify the roles of the arguments so you morphologically mark the arguments with their roles.</p>

<h3>Marking the verb</h3>

<p>Many languages mark the verb with some information about the argument in a certain role, so that we can properly identify the argument&#8217;s roles. This kind of phenomenon is called <em>agreement</em>.</p>

<p>The most common type of verbal agreement is subject agreement, where the verb is marked by a specific form depending on some features of the subject. Anyone who&#8217;s taken French 101 will recognize this verb conjugation paradigm:</p>

<table>
<tr><th></th><th>subject</th><th>être (to be)</th></tr>
<tr><td rowspan='3'>singular</td><td>je (I)</td><td>suis</td></tr>
<tr><td>tu (you)</td><td>es</td></tr>
<tr><td>il/elle (he/she)</td><td>est</td></tr>
<tr><td rowspan='3'>plural</td><td>nous (we)</td><td>sommes</td></tr>
<tr><td>vous (plural you)</td><td>êtes</td></tr>
<tr><td>ils (they)</td><td>sont</td></tr>
</table>

<p>With this paradigm, if you hear or see &#8220;suis&#8221; in a French sentence, you immediately know that &#8220;je&#8221; (<em>I</em>) must be the subject and if you see &#8220;sommes,&#8221; &#8220;nous&#8221; (<em>we</em>) is the subject, etc. <a href="http://en.wikipedia.org/wiki/Standard Average European">Standard Average European</a> languages tend to exhibit this sort of subject-verb agreement.</p>

<p>Features of the subject position aren&#8217;t the only thing that can be marked on the verb, though. Hungarian, for example, has a type of object agreement. Specifically, the verb marks whether the object is definite or not (in linguistics lingo, &#8220;the verb agrees with the object&#8217;s definiteness feature&#8221;).</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>12
13
14
15
</pre></td><td class="code"><pre class="hungarian" style="font-family:monospace;">John lát  egy almát.
John sees an  apple
John látja az  almát.
John sees  the apple</pre></td></tr></table></div>


<p>Notice that in sentence (12) (glossed in (13)) the verb for &#8220;see&#8221; is realized as &#8220;lát,&#8221; while in (14) it&#8217;s &#8220;látja.&#8221; A speaker can use that agreement to see whether the object is definite or not and thus limit the possible object arguments out of all the nouns in the sentence.</p>

<h3>All of the above</h3>

<p><a href='http://www.qwantz.com/'><img src="http://mitcho.com/blog/wp-content/uploads/2009/02/whom.gif" alt="whom.gif" border="0" width="650" height="442" /></a></p>

<p>Most languages do not use only one of these strategies, but a combination of them. English is a very good example. In a sentence like (12) below the main coding of grammatical roles seems to be word order alone. By reversing the word order into (13), we can effectively swap the argument&#8217;s roles.</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>12
13
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">John likes Mary.
Mary likes John.</pre></td></tr></table></div>


<p>However, this doesn&#8217;t work with pronominal arguments. Swapping the arguments in (14) yields (15) which is ungrammatical due to the case marking on the pronouns.</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>14
15
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">He likes her.
* Her likes he.</pre></td></tr></table></div>


<p>In addition, the verb in English must agree with the subject&#8217;s number (singular or plural):</p>


<div class="wp_syntax"><table><tr><td class="line_numbers"><pre>16
17
18
</pre></td><td class="code"><pre class="english" style="font-family:monospace;">John likes them.
* They likes John.
They like John.</pre></td></tr></table></div>


<p>In this way, English exhibits all three strategies: word order, case marking, and agreement, although often only word order is actively used to disambiguate the roles of arguments.</p>

<p><strong>Question:</strong> What strategies are used by your language to mark the roles of different arguments?</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:2">
<p>The following example is from <a href="http://www.personal.uni-jena.de/~x4diho/LingTyp%20Grammatical%20relations.ppt">Holger Diessel</a>.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p>&#8220;Mean the same thing&#8221; here means that the teacher is always instructing and the boy is always being instructed. The sentences may differ in when or how they are used depending on which argument is being talked about or what the implications of the utterance are. The formal notion is <em>truth-conditional equivalence</em>.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>


<p>Related posts:<ol><li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Permanent Link: Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Permanent Link: Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Permanent Link: Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol></p>
<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Testing Google&#8217;s Language Detection</title>
		<link>http://mitcho.com/blog/observation/testing-googles-language-detection/</link>
		<comments>http://mitcho.com/blog/observation/testing-googles-language-detection/#comments</comments>
		<pubDate>Sat, 17 May 2008 09:47:04 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[observation]]></category>
		<category><![CDATA[Chinese]]></category>
		<category><![CDATA[Chinese characters]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[computational linguistics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language detection]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=254</guid>
		<description><![CDATA[As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of [...]


Related posts:<ol><li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Permanent Link: Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Permanent Link: Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/how-to/adding-your-language-to-ubiquity-parser-2/' rel='bookmark' title='Permanent Link: Adding Your Language to Ubiquity Parser 2'>Adding Your Language to Ubiquity Parser 2</a></li>
</ol>

Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/google-code.png" alt="google code" title="google-code" width="156" height="57" /></p>

<p>As <a href="http://googleblog.blogspot.com/2008/05/google-translate-adds-10-new-languages.html">Google adds ten more languages to its machine translation service</a>, it seems to be on its way to becoming the most convenient <a href="http://en.wikipedia.org/wiki/universal translator">universal translator</a> of the world&#8217;s popular languages. Google&#8217;s handling of languages of course isn&#8217;t perfect, however—in particular, I&#8217;ve been complaining to friends for a while about the weaknesses of Google&#8217;s handling of queries in Chinese character (<a href="http://en.wikipedia.org/wiki/Chinese characters">漢字/汉字</a>) scripts. In this post, I run some tests using Google&#8217;s <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">Language Detection service</a> to try to better understand its handling of Chinese character queries.</p>

<h3>Background</h3>

<p>Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the &#8220;CJKV&#8221;). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there&#8217;s <a href="http://en.wikipedia.org/wiki/Chinese character">Wikipedia</a> or <a href="http://books.google.com/books?id=htlttpi1KOoC">Ken Lunde&#8217;s CJKV Information Processing</a>.</p>

<h3>The problem</h3>

<p>The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, <strong>Google doesn&#8217;t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.</strong><sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup></p>

<p><span id="more-254"></span></p>

<p>Take, for example, a query like &#8220;七面鳥&#8221;, meaning &#8216;turkey&#8217; in Japanese. While all characters are very common in traditional Chinese (鳥 is simplified to 鸟 in China), the combination &#8220;七面鳥&#8221; is quite rare in Chinese. However, when you search for &#8220;七面鳥,&#8221; many of the first results are in Chinese and only two of the first ten results are in Japanese.</p>

<p>Does Google&#8217;s corpus not identify &#8220;七面鳥&#8221; as a primarily Japanese word? Google does indeed attest to this fact: searching for &#8220;七面鳥&#8221; and limiting to a certain language yields the following number of hits. A similar effect can be seen with Japanese words such as &#8220;芝生&#8221; (&#8216;grass&#8217;) or &#8220;泥棒&#8221; (&#8216;burglar&#8217;). The &#8220;Japanese on first page&#8221; column gives the number of results that are in Japanese which come up in a language-unspecified search from the US.</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (simplified)</th><th> Chinese (traditional)</th><th> Japanese </th><th>Japanese on<br />first page</th></tr>
<tr><th>七面鳥</th><td>786</td><td>926</td><td>395,000</td><td>2/10</td></tr>
<tr><th>芝生</th><td>55,600</td><td>216,000</td><td>2,230,000</td><td>0/10</td></tr>
<tr><th>泥棒</th><td>13,500</td><td>22,500</td><td>10,400,000</td><td>3/10</td></tr>
</table>

<p>In a perfect world, I would like Google to <strong>identify the language that the query is in</strong>, and then <strong>weigh results that are in that language higher</strong> in the results list. So the issue comes down to one of <strong>language detection</strong>.</p>

<iframe src="http://www.google.com/uds/samples/language/detect.html" width='400px' height="200px"></iframe>

<p>There are broadly two different approaches to language detection and, indeed, all natural language processing problems: <em>parsing</em> and <em>counting</em>. In this case, parsing involves trying to break apart the query into words and then computing how likely such a string of <em>words</em> is in each given language. Counting simply takes an inventory of the characters given and compares them to their frequencies in each language, computing how likely such a string of <em>characters</em> is in each language. Parsing is the &#8220;smarter&#8221; approach, but more difficult and computationally intensive.</p>

<p>Google was kind enough to give us an <a href="http://googleblog.blogspot.com/2008/03/new-google-ajax-language-api-tools-for.html">language detection AJAX service</a> so we can get a sense for how their language detection works. This service also gives a &#8220;confidence&#8221; value on the detection result. For the rest of this entry, we&#8217;ll test some hypotheses against this service and conclude at the end.</p>

<h3>Do spaces matter?</h3>

<p><strong>No.</strong> While spaces are sometimes used in Japanese and Chinese writing to represent word boundaries, especially around numbers and roman letters, they also are seen on the web to encourage line breaks. It would make sense for Google&#8217;s language detection service to ignore spaces in Chinese character queries and that does seem to be the case. All tests I ran with Chinese character queries gave the same result with same confidence with and without spaces in random places.</p>

<h3>Does order matter?</h3>

<p><strong>No.</strong> This was slightly disappointing to see. I took the Japanese string &#8220;骨粗鬆症&#8221; (&#8216;osteoporosis&#8217;, if you&#8217;re curious) and ran every permutation against the language detector and got the same results, including the same confidence values. This is a clear indicator that Google uses only counting, not parsing, in their parser.</p>

<h3>Does repetition matter?</h3>

<p><strong>Yes.</strong> Now that it seems that Google does not use any parsing and only uses character frequencies in identifying the source language, let&#8217;s see how repetition can affect the detection service.</p>

<p>First, I took some Chinese character strings and ran them through the detection service with different numbers of repetitions, e.g. &#8220;参加&#8221;, &#8220;参加参加&#8221;, &#8220;参加参加参加&#8221;, &#8220;参加参加参加参加&#8221;&#8230; The queries I used were the following:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&nbsp;</th><th>Chinese (traditional)</th><th>Japanese</th><th>Chinese (simplified)</th></tr>
<tr><th>木</th><td>X</td><td>X</td><td>X</td></tr>
<tr><th>漢字</th><td>X</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>氣</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>參加</th><td>X</td><td>&nbsp;</td><td>&nbsp;</td></tr>
<tr><th>参加</th><td>&nbsp;</td><td>X</td><td>X</td></tr>
<tr><th>気</th><td>&nbsp;</td><td>X</td><td>&nbsp;</td></tr>
<tr><th>气</th><td>&nbsp;</td><td>&nbsp;</td><td>X</td></tr>
</table>

<p>For each token type, the detection service made up its mind quite quickly. Its confidence, however, was more interesting.</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-7.png" alt="" title="repetition vs. confidence" /></center></p>

<p>Each of the confidence values dips sharply after three, five, or ten repetitions. Note, however, the length of the tokens which dipped at each of those points. I interpret this to mean that <strong>there is a different parser for less than ten characters and ten or more characters.</strong> However, the detection service did not change its answer after this point on any of the tokens.</p>

<p>Second, I took two characters, &#8220;簡&#8221; and &#8220;体,&#8221; and crossed different numbers of them together to see how that would affect the language detected. Note that &#8220;簡&#8221; is used in traditional Chinese and Japanese, while &#8220;体&#8221; is used in simplified Chinese and Japanese.</p>

<p><style type="text/css">
table .zh { background-color: #e3d2d2; }
table .zh-Hant { background-color: #d3e3d2; }
table .ja { background-color: #d5d2e3; }
</style></p>

<table style="margin-left:auto;margin-right:auto;">
<tr><th>&nbsp;</th><th>簡x0</th><th>簡x1</th><th>簡x2</th><th>簡x3</th><th>簡x4</th><th>簡x5</th><th>簡x6</th><th>簡x7</th><th>簡x8</th><th>簡x9</th></tr>
<tr><th>体x0</th><td>&nbsp;</td> <td class='zh'>0.995</td> <td class='zh'>0.998</td> <td class='zh'>0.998</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x1</th><td class='zh-Hant'>0.995</td> <td class='ja'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> <td class='zh'>0.999</td> </tr>
<tr><th>体x2</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='zh'>0.531</td> </tr>
<tr><th>体x3</th><td class='zh-Hant'>0.998</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.52</td> <td class='ja'>0.568</td> </tr>
<tr><th>体x4</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.516</td> <td class='ja'>0.565</td> <td class='ja'>0.613</td> </tr>
<tr><th>体x5</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.512</td> <td class='ja'>0.561</td> <td class='ja'>0.609</td> <td class='ja'>0.657</td> </tr>
<tr><th>体x6</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.507</td> <td class='ja'>0.556</td> <td class='ja'>0.605</td> <td class='ja'>0.653</td> <td class='ja'>0.702</td> </tr>
<tr><th>体x7</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>0.999</td> <td class='ja'>0.502</td> <td class='ja'>0.551</td> <td class='ja'>0.6</td> <td class='ja'>0.649</td> <td class='ja'>0.697</td> <td class='ja'>0.746</td> </tr>
<tr><th>体x8</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='ja'>1</td> <td class='ja'>0.545</td> <td class='ja'>0.595</td> <td class='ja'>0.644</td> <td class='ja'>0.693</td> <td class='ja'>0.741</td> <td class='ja'>0.79</td> </tr>
<tr><th>体x9</th><td class='zh-Hant'>0.999</td> <td class='zh-Hant'>0.999</td> <td class='zh-Hant'>1</td> <td class='ja'>0.539</td> <td class='ja'>0.589</td> <td class='ja'>0.638</td> <td class='ja'>0.687</td> <td class='ja'>0.736</td> <td class='ja'>0.785</td> <td class='ja'>0.834</td> </tr>
</table>

<table style="margin-left:auto;margin-right:auto;">
<tr><td class="ja">Japanese</td><td class='zh-Hant'>Chinese (traditional)</td><td class='zh'>Chinese (simplified)</td></tr>
</table>

<h3>Conclusion</h3>

<p>For Chinese character-based languages, Google&#8217;s language detection algorithm uses simple counting rather than parsing, identifying languages by looking at the <em>frequency of characters</em> rather than the <em>frequency of words</em>. As such, the algorithm simply acts as a <strong>script detector, not a language detector.</strong> Moreover, as a simple counting method is used, duplicating characters used in one language but not another can very easily skew the resulting output.</p>

<p>As a trivial aside, it seems that Google&#8217;s algorithm is slightly different for strings less than ten characters, as can be seen in a dip and then rise of confidence values after ten characters.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>Just to complicate matters further, there&#8217;s also the issue of where you&#8217;re accessing Google from. For example, accessing from the US (or via my friend <a href="http://support.uchicago.edu/docs/network/vpn/">VPN</a>), a query for the Japanese-simplified &#8220;天気&#8221; seems to only return Japanese pages. However, accessing from Taiwan, Google assumes you may have meant the full-form &#8220;天氣&#8221;, giving you pages with both &#8220;天気&#8221; and &#8220;天氣&#8221;. As a result, Yahoo Japan weather is the first result from the US and third from Taiwan, while Yahoo Taiwan weather is first in Taiwan and doesn&#8217;t even show up from the US. This default character substitution in Taiwan is one of my least-favorite Google &#8220;features.&#8221;<br /><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-1-300x256.png" alt="" title="picture-1"/></a><a rel="lightbox[google]" href='http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2.png'><img class="images" src="http://mitcho.com/blog/wp-content/uploads/2008/05/picture-2-300x256.png" alt="" title="picture-2"/></a><br />Similar effects can most likely be seen between the US and China. In the rest of this post, all queries will be made from the US.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>


<p>Related posts:<ol><li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Permanent Link: Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Permanent Link: Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/how-to/adding-your-language-to-ubiquity-parser-2/' rel='bookmark' title='Permanent Link: Adding Your Language to Ubiquity Parser 2'>Adding Your Language to Ubiquity Parser 2</a></li>
</ol></p>
<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/testing-googles-language-detection/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Linguistics in 嘉義</title>
		<link>http://mitcho.com/blog/life/travel/linguistics-in-%e5%98%89%e7%be%a9/</link>
		<comments>http://mitcho.com/blog/life/travel/linguistics-in-%e5%98%89%e7%be%a9/#comments</comments>
		<pubDate>Tue, 13 May 2008 15:24:32 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[travelogue]]></category>
		<category><![CDATA[Chiayi]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[friends]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>
		<category><![CDATA[Taiwan]]></category>
		<category><![CDATA[train]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=240</guid>
		<description><![CDATA[A couple weeks ago I went to Chiayi (嘉義, pinyin: Jiāyì) to present a paper at the Linguistic Society of Taiwan&#8217;s National Conference on Linguistics.1 I got a chance to meet some wonderful and kind Taiwanese linguists, make friends with some linguistics students, as well as explore the city of Chiayi. Chiayi is a medium-sized [...]


Related posts:<ol><li><a href='http://mitcho.com/blog/life/travel/%e6%96%b0%e5%b9%b4%e5%bf%ab%e6%a8%82-chinese-new-year-with-andy/' rel='bookmark' title='Permanent Link: 新年快樂! Chinese New Year with Andy'>新年快樂! Chinese New Year with Andy</a></li>
<li><a href='http://mitcho.com/blog/projects/spring-is-for-speaking/' rel='bookmark' title='Permanent Link: Spring is for Speaking: JSConf, WordCamp SF, IACL'>Spring is for Speaking: JSConf, WordCamp SF, IACL</a></li>
<li><a href='http://mitcho.com/blog/life/co-schooling-in-dongshan/' rel='bookmark' title='Permanent Link: Co-schooling in Dongshan'>Co-schooling in Dongshan</a></li>
</ol>

Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>A couple weeks ago I went to Chiayi (嘉義, pinyin: Jiāyì) to present a paper at the <a href="http://www.linguist.tw">Linguistic Society of Taiwan&#8217;s</a> <a href="http://web.ncyu.edu.tw/~wujs/NCL2008/NCL2008_English.htm">National Conference on Linguistics</a>.<sup id="fnref:1"><a href="#fn:1" rel="footnote">1</a></sup> I got a chance to meet some wonderful and kind Taiwanese linguists, make friends with some linguistics students, as well as explore the city of Chiayi.</p>

<p><span id="more-240"></span></p>

<p>Chiayi is a medium-sized city (270k people, so still way bigger that Luodong or Yilan) on the plains of southwestern Taiwan. The good news about getting to Chiayi is that there is a <a href="http://en.wikipedia.org/wiki/Taiwan High Speed Rail">high speed rail</a> station—the bad news is that that station is actually about half an hour east of the city by car. I took a taxi into the city Thursday night, but took the free shuttle service on Sunday.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup> As is the case with most cities on the west coast, it personally took me more time to go from Nanao to Taipei than to then take the high speed rail down to whatever city&#8230; such is life on the east coast: sans high speed rail.</p>

<p>The conference itself was Friday and Saturday. This particular conference was limited to speakers who were current students or recent MA or PhD graduates, so many of the talks were exploratory and less developed. They were still a lot of fun for me to see, though, especially as many of them were on Mandarin or Taiwanese, so there was a lot of data and phenomena that I&#8217;d never even considered. It was also great to see professor Luther Liu, an eminent researcher of Chinese comparatives, whom I met in 2006 at the <a href="http://humanities.uchicago.edu/depts/linguistics/chinese/">Chicago Workshop on Chinese Linguistics</a>, as well as many other friendly professors. I my gave my talk on Saturday and received an award for my paper.</p>

<p>You can see Luther Liu and I talking in this first picture below&#8230; try to find us!<sup id="fnref:4"><a href="#fn:4" rel="footnote">3</a></sup></p>

<p><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi1.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi1.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi7.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi7.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi8.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi8.jpg' /></a></p>

<p>Each talk at the conference was followed by prepared constructive criticism by a &#8220;commentator&#8221; who&#8217;s a professor with similar research interests. As a corollary, while all the speakers at the conference were younger, a good number (30+) of professors from all around the island were in attendance as well. I believe this annual conference is an excellent opportunity for ling students in Taiwan to have their work known and criticized by professors outside of their own departments, and also to get to know others in their field. It fosters a sense of community among young researchers outside of their own schools&#8212;I&#8217;d love to see more such activities back in the US.<sup id="fnref:3"><a href="#fn:3" rel="footnote">4</a></sup></p>

<p>On Saturday evening after the conference I went out with some MA students from <a href="http://en.wikipedia.org/wiki/National Tsing Hua University">Tsinghua University</a> (國立清華大學, or simply 清大). As one was originally from Chiayi and another went to school there, I was in good hands for finding the best local food. We first hit up a stand to get some 火雞肉飯 (turkey rice) which is a Chiayi delicacy&#8230; it&#8217;s so simple yet so delicious!</p>

<p><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi2.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi2.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi3.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi3.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi4.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi4.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi5.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi5.jpg' /></a></p>

<p>Afterwards we walked around in their night market, eating some Finally, here&#8217;s a photo we took in front of the traffic circle which is a Chiayi landmark. Thanks for the good times!</p>

<p><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi6.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi6.jpg' /></a><a rel='lightbox[linguistics-in-%e5%98%89%e7%be%a9]' href='http://mitcho.com/photos/taiwan/chiayi/image/1000/chiayi9.jpg' alt='zenphoto image'><img class='images' src='http://mitcho.com/photos/taiwan/chiayi/image/thumb/chiayi9.jpg' /></a></p>

<p>Next up is the <a href="http://www.fl.nctu.edu.tw/~IsCLL/">International Symposium on Chinese Languages and Linguistics (IsCLL)</a> that I&#8217;ll be attending (but not presenting at) in a couple weeks, so I look forward to seeing some of my new linguist friends there again!</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:1">
<p>&#8220;The Verbal Nature of Mandarin Comparative <em>bi</em>&#8221;. Check out the <a href="/academic/erlewine-ncl2008-preprint.pdf">paper</a> or the <a href="/academic/handout-20080503.pdf">handout</a>.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Every twenty minutes, from the back of Chiayi train station.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>Thanks to Claudia for most of the photos here!&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:3">
<p>Chicago had a similar program, in the form of the &#8220;professionalism seminar&#8221; (which I took with <a href="http://home.uchicago.edu/~giannaki/">Anastasia Giannakidou</a>) and related &#8220;Graduate Student Mini-conference,&#8221; and I&#8217;m sure other schools in the US have similar opportunities for their MA and PhD students. The environment is different, however, as the field of formal linguistics is even smaller than in the US, so in some ways that community-building across programs is both more important and also easier to accomplish.&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>


<p>Related posts:<ol><li><a href='http://mitcho.com/blog/life/travel/%e6%96%b0%e5%b9%b4%e5%bf%ab%e6%a8%82-chinese-new-year-with-andy/' rel='bookmark' title='Permanent Link: 新年快樂! Chinese New Year with Andy'>新年快樂! Chinese New Year with Andy</a></li>
<li><a href='http://mitcho.com/blog/projects/spring-is-for-speaking/' rel='bookmark' title='Permanent Link: Spring is for Speaking: JSConf, WordCamp SF, IACL'>Spring is for Speaking: JSConf, WordCamp SF, IACL</a></li>
<li><a href='http://mitcho.com/blog/life/co-schooling-in-dongshan/' rel='bookmark' title='Permanent Link: Co-schooling in Dongshan'>Co-schooling in Dongshan</a></li>
</ol></p>
<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/life/travel/linguistics-in-%e5%98%89%e7%be%a9/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>I&#8217;m Busy to Die</title>
		<link>http://mitcho.com/blog/observation/im-busy-to-die/</link>
		<comments>http://mitcho.com/blog/observation/im-busy-to-die/#comments</comments>
		<pubDate>Tue, 30 Oct 2007 08:25:19 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mandarin]]></category>
		<category><![CDATA[Taiwan]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/2007/10/30/transfer-of-the-mandarin-resultative/</guid>
		<description><![CDATA[Today at work: the military guy who has quite good English told me that he was very busy as our school is being observed next week by administrators. He then told me, &#8220;I&#8217;m busy to die.&#8221; While I originally thought he might have mispronounced &#8220;today,&#8221; he obviously knows that word&#8230; I believe he was trying [...]



Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Today at work: the <a href="http://en.wikipedia.org/wiki/Conscription_in_the_Republic_of_China">military guy</a> who has quite good English told me that he was very busy as our school is being observed next week by administrators. He then told me, &#8220;I&#8217;m busy to die.&#8221;</p>

<p>While I originally thought he might have mispronounced &#8220;today,&#8221; he obviously knows that word&#8230; I believe he was trying to say “<abbr title="I">我</abbr><abbr title="busy">忙</abbr><abbr title="die (dead)">死</abbr><abbr title="Aspect">了</abbr>,” a Mandarin resultative construction which could be translated &#8220;I&#8217;m busy to the extent that I will die.&#8221; Obviously this is not literal&#8230; V+<abbr title="die (dead)">死</abbr><abbr title="Aspect">了</abbr> compounds are a common form of exaggeration. It was a neat instance of <a href="http://en.wikipedia.org/wiki/Language_transfer">grammatical transfer</a>, though.</p>



<p>Related posts brought to you by <a href='http://mitcho.com/code/yarpp/'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/im-busy-to-die/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
