<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mitcho.com &#187; data</title>
	<atom:link href="http://mitcho.com/blog/tag/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://mitcho.com</link>
	<description></description>
	<lastBuildDate>Tue, 07 Feb 2012 02:04:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4-alpha-19719</generator>
		<item>
		<title>Ubiquity Commands by The Numbers</title>
		<link>http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/</link>
		<comments>http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/#comments</comments>
		<pubDate>Wed, 01 Apr 2009 03:11:55 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[herd]]></category>
		<category><![CDATA[localization]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[nountypes]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[ubiquity]]></category>
		<category><![CDATA[verbs]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1718</guid>
		<description><![CDATA[Recent work in the Ubiquity internationalization realm has focused on the upcoming Ubiquity parser which will bring some great new features to Ubiquity, including support for overlord verbs and semi-automatic localization of commands via semantic roles. It&#8217;s possible, though, that these new features will break backwards compatibility of the current command specification and noun types. [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Recent work in the Ubiquity internationalization realm has focused on the upcoming Ubiquity parser which will bring some great new features to Ubiquity, including support for <a href="http://jonoscript.wordpress.com/2009/01/24/overlord-verbs-a-proposal/">overlord verbs</a> and <a href="http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/">semi-automatic localization of commands via semantic roles</a>. It&#8217;s possible, though, that these new features will break backwards compatibility of the current command specification and noun types. <a href="http://en.wikipedia.org/wiki/Creative destruction">Creative destruction</a> for the win.</p>

<p>As we look to <a href="http://groups.google.com/group/ubiquity-i18n/browse_thread/thread/22fa223f43ef6262">move forward</a> with incorporating <a href="http://mitcho.com/code/ubiquity/parser-demo/">the next generation parser</a> into Ubiquity proper, it thus becomes important to take a look at the current command ecosystem to see how possibly disruptive this move will be. To this end last night I wrote a quick perl script to scrape the commands cached on <a href="http://ubiquity.mozilla.com/herd/">the herd</a> and get some quantitative answers to my questions.</p>

<p><span id="more-1718"></span></p>

<p>(1577 different verbs were analyzed. None of these computations below are weighted by feed popularity.)</p>

<h3>Q: Are there a lot of commands which use more than one argument?</h3>

<p>A: The vast majority (>85%) of commands take one or no arguments, requiring no modifiers. Only those remaining 15% will require a switch to refer to different arguments by <a href="http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/">semantic role</a>.</p>

<p><center><img src="http://mitcho.com/blog/wp-content/uploads/2009/03/herdcommands.png" alt="herdcommands.png" border="0" width="500" height="355" /></center></p>

<h3>Q: Do many commands introduce custom noun types?</h3>

<p>A: 147 different noun types (lumping anonymous inline objects as one type) were detected. The vast majority of all <code>takes</code> (direct object) arguments were of type <code>noun_arb_text</code>, although many <code>modifiers</code> arguments used custom noun types. The other standard (built-in) noun types are well represented as well, with <code>noun_type_language</code> coming in at second place. Here&#8217;s a chart with all the noun types which had more than one use.</p>

<div style='overflow-y: auto; max-height: 300px;'><center><img src="http://mitcho.com/blog/wp-content/uploads/2009/03/herdnountypes1.png" alt="herdnountypes.png" border="0" width="550" height="846" /></center></div>

<h3>Q: Are commands with <code>modifiers</code> using natural-language delimiters?</h3>

<p>A: Most of the modifiers detected were English prepositions such as &#8220;from&#8221;, &#8220;to&#8221;, &#8220;as&#8221;, &#8220;with&#8221;, but other words were also seen such as &#8220;title&#8221;, &#8220;type&#8221;, &#8220;username&#8221;, and &#8220;message&#8221; and even a handful of commands with symbols such as &#8220;@&#8221;, &#8220;>&#8221;, or &#8220;#&#8221;.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-parser-the-next-generation-demo/' rel='bookmark' title='Ubiquity Parser: The Next Generation Demo'>Ubiquity Parser: The Next Generation Demo</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/ubiquity-commands-by-the-numbers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Automating the Linguist&#8217;s Job</title>
		<link>http://mitcho.com/blog/projects/automating-the-linguists-job/</link>
		<comments>http://mitcho.com/blog/projects/automating-the-linguists-job/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 08:57:58 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[analogy]]></category>
		<category><![CDATA[automation]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[deduction]]></category>
		<category><![CDATA[Dutch]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[patterns]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1634</guid>
		<description><![CDATA[At the end of my blog post yesterday I hinted at an exciting possible approach to Ubiquity&#8217;s localization: In the future we ideally could build a web-based system to collect these &#8220;utterances.&#8221; We could &#8230; generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/' rel='bookmark' title='Ubiquity i18n: questions to ask'>Ubiquity i18n: questions to ask</a></li>
<li><a href='http://mitcho.com/blog/projects/localizing-ubiquity-an-open-letter-to-linguists/' rel='bookmark' title='Localizing Ubiquity: an open letter to linguists'>Localizing Ubiquity: an open letter to linguists</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>At the end of <a href="http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/">my blog post yesterday</a> I hinted at an exciting possible approach to Ubiquity&#8217;s localization:</p>

<blockquote>
  <p>In the future we ideally could build a web-based system to collect these &#8220;utterances.&#8221; We could &#8230; generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.</p>
</blockquote>

<p>If we build this type of &#8220;command-bank&#8221; of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially <em>learning the language based on data</em>. Today I&#8217;ll elaborate on how I believe this could be possible, by analogy to another language learning device: <strong>the human</strong>.</p>

<p><span id="more-1634"></span></p>

<h3>Step 1: learning words</h3>

<p>How does a human learn language? Without getting into any <a href="http://en.wikipedia.org/wiki/language acquisition">details or theory</a>, we can say that the input for a language learner is always a combination of <em>linguistic input and a referent</em>. In the case of a child, this could be a pairing of linguistic input with <em>real world stimulus</em>:</p>

<p><center></p>

<table style='border:none;'><tr><th>input</th><th>referent</th></tr>
<tr><td style='font-size:2em;color:orange;font-weight:bold;text-align:center;'>“taiyaki!”</td><td><img src='http://farm4.static.flickr.com/3543/3357452751_977fcce70c.jpg?v=0' width='300'/><br/>
by <a href='http://www.flickr.com/photos/makitani/3357452751/'>makitani</a> via <a href='http://creativecommons.org'>creative commons</a>.</td></tr>
<tr><td style='font-size:2em;color:orange;font-weight:bold;width:50%;text-align:center;'>“cat!”</td><td><img src='http://farm4.static.flickr.com/3285/2387513295_2768ddf662.jpg?v=0' width='300'/><br />
by <a href='http://www.flickr.com/photos/victoriachan/2387513295/in/set-72157604986983169/'>victoriachan</a> via <a href='http://creativecommons.org'>creative commons</a>.</td></tr>
</table>

<p></center></p>

<p>The human child will hear &#8220;cat&#8221; while looking at the cat and, with time and repetition, learn that that thing is called a &#8220;cat,&#8221; and <a href="http://en.wikipedia.org/wiki/taiyaki">some other thing</a> is called &#8220;taiyaki.&#8221;</p>

<p>Similarly, we could take single-verb data points from our command-bank to match new words with a know referent—in this case, the base English string. Here&#8217;s an example from <a href="http://jan.moesen.nu/">Jan&#8217;s</a> comment on <a href="http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/">yesterday&#8217;s sample survey</a>.</p>

<p><center></p>

<table style='border:none;'><tr><th>input (Dutch)</th><th>referent (English)</th></tr>
<tr><td style='font-size:2em;color:orange;font-weight:bold;text-align:center;'>zoek</td><td style='font-size:2em;color:blue;font-weight:bold;text-align:center;'>search</td></tr>
</table>

<p></center></p>

<h3>Step 2: deduction</h3>

<p>Now suppose we know some single words like &#8220;taiyaki&#8221; and &#8220;cat.&#8221; Consider the two situations. Given the first sentence and referent &#8220;mitcho&#8217;s eating a taiyaki,&#8221; the child could intuit the appropriate linguistic representation for the latter situation.</p>

<p><center></p>

<table style='border:none;'><tr><th>input</th><th>referent</th></tr>
<tr><td style='font-size:2em;color:orange;font-weight:bold;width:50%;text-align:center;'>“mitcho&#8217;s eating a taiyaki!”</td><td><img src="http://mitcho.com/blog/wp-content/uploads/2009/03/eattaiyaki.jpg" alt="eattaiyaki.jpg" border="0" width="300" height="225" /></td></tr>
<tr><td style='font-size:2em;color:red;font-weight:bold;text-align:center;'>???</td><td><img src="http://mitcho.com/blog/wp-content/uploads/2009/03/eatcat.jpg" alt="eatcat.jpg" border="0" width="300" height="225" /></td></tr>
</table>

<p></center></p>

<p>The process is simple. First note that there is only one variable changed between the two situations: the taiyaki has been replaced by a cat head. You can then construct the correct utterance <em>by analogy</em>, replacing &#8220;taiyaki&#8221; with &#8220;cat,&#8221; yielding &#8220;mitcho&#8217;s eating a cat!&#8221;<sup id="fnref:2"><a href="#fn:2" rel="footnote">1</a></sup></p>

<p>Similarly, we could build a tool to analyze the data in a translated command-bank to identify particular features of each language, generating at least basic parsers for each language. Such a task would require a number of <em><a href="http://en.wikipedia.org/wiki/minimal pairs">minimal pairs</a></em> in our data set—here&#8217;s one such example from yesterday&#8217;s survey (with Dutch data from <a href="http://jan.moesen.nu/">Jan</a>):</p>

<p><center></p>

<table style='border:none;'><tr><th>input (Dutch)</th><th>referent (English)</th></tr>
<tr><td style='font-size:1.5em;color:orange;font-weight:bold;text-align:center;'>zoek HELLO met Google</td><td>
<span style='font-size:1.5em;color:blue;font-weight:bold;'>search HELLO with Google</span><br/>
<code>
<pre>Parse {
  verb:      'search',
  arguments: {
    object:  ['HELLO'],
    service: 'Google'
  }
}</pre>
</code></td></tr>
<tr><td style='font-size:1.5em;color:orange;font-weight:bold;text-align:center;'>zoek dit met Google</td><td>
<span style='font-size:1.5em;color:blue;font-weight:bold;'>search this with Google</span><br/>
<code>
<pre>Parse {
  verb:      'search',
  arguments: {
    object:  ['this'],
    service: 'Google'
  }
}</pre>
</code></td></tr></table>

<p></center></p>

<p>A simple string analysis<sup id="fnref:3"><a href="#fn:3" rel="footnote">2</a></sup> would tell us that the text <code>HELLO</code> was replaced by <code>dit</code> in the latter Dutch sentence. Meanwhile, since the English reference sentence is chosen manually, we also know the appropriate parses for each of those sentences. An object difference operation would note that the <code>object</code> property was changed from a value of <code>'HELLO'</code> to <code>'this'</code>. We could then map <code>dit</code> to the English <code>this</code>. We&#8217;ve now learned one (of perhaps many) Dutch deictic pronouns (aka &#8220;magic words&#8221;).</p>

<p>Given <a href="http://mitcho.com/code/ubiquity/parser-demo/">an adequately universal but customizable parser design</a>, we can then develop tests for various parameters by constructing appropriate <a href="http://en.wikipedia.org/wiki/minimal pairs">minimal pairs</a> in the base sentences and having them translated.<sup id="fnref:1"><a href="#fn:1" rel="footnote">3</a></sup> As noted yesterday, such a system could reduce the laborious task of writing individual parsers to a task of string translation, which <a href="https://wiki.mozilla.org/L10n:Home_Page">our community does exceedingly well</a>. <strong>I&#8217;m eager to hear what others think of this approach. What concerns would you have for this approach? What potential benefits do you see?</strong></p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:2">
<p>I mean no offense to human children with this simplified example. Surely you can learn more than just string replacements.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:3">
<p>I started building some string analysis toys in JavaScript today, such as a <a href="http://mitcho.com/code/ubiquity/levenshtein/">Levenshtein difference demo</a>.&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p>The linguists in the audience may note that this parser&#8217;s modular design is indeed in the spirt of the <a href="http://en.wikipedia.org/wiki/principles and parameters">principles and parameters</a> framework.&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/' rel='bookmark' title='Ubiquity i18n: questions to ask'>Ubiquity i18n: questions to ask</a></li>
<li><a href='http://mitcho.com/blog/projects/localizing-ubiquity-an-open-letter-to-linguists/' rel='bookmark' title='Localizing Ubiquity: an open letter to linguists'>Localizing Ubiquity: an open letter to linguists</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/automating-the-linguists-job/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Ubiquity i18n: questions to ask</title>
		<link>http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/</link>
		<comments>http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 10:13:37 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[collaboration]]></category>
		<category><![CDATA[commands]]></category>
		<category><![CDATA[contribute]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[survey]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1611</guid>
		<description><![CDATA[I recently have traveled a fair deal and have met many people excited about the Ubiquity project and its localization efforts. &#8220;I want to help,&#8221; say the people, but many are unsure where to start. As a linguist, studying a language involves looking at instances of that language as data. To this end, we as [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>I recently have traveled a fair deal and have met many people excited about <a href="http://ubiquity.mozilla.com">the Ubiquity project</a> and <a href="https://wiki.mozilla.org/Labs/Ubiquity/i18n">its localization efforts</a>. &#8220;I want to help,&#8221; say the people, but many are unsure where to start.</p>

<p>As a linguist, studying a language involves looking at instances of that language as data. To this end, we as Ubiquity internationalizers need to get at some examples of <em>target utterances</em>. Here&#8217;s an example survey which could be a good starting point for native speakers who want to contribute information on their language, based on <a href="https://wiki.mozilla.org/Taskfox/Verbs">Blair&#8217;s list of common Ubiquity verbs</a>.</p>

<p><span id="more-1611"></span></p>

<hr style='border-top: 2px gray dashed; height: 0; color: white;'/>

<h2>A survey for Ubiquity localization</h2>

<h3>Instructions</h3>

<p>How would you express the following commands in your language? The words in CAPITAL LETTERS do not need to be translated. Feel free to give multiple possible answers for each command.</p>

<p>Try to express the same command rather than forcing a &#8220;literal translation&#8221;; for example, if there&#8217;s no &#8220;map&#8221; verb in your language, you could translate example (8) as <code>lookup a map of PLACE</code>. Please keep in mind that the <a href="http://en.wikipedia.org/wiki/addressee">addressee</a> is a computer.</p>

<h3>Basic word order / argument structure</h3>

<ol>
<li><code>search HELLO</code></li>
<li><code>search HELLO with google</code></li>
<li><code>translate HELLO from English to French</code></li>
<li><code>lookup the weather for PLACE</code></li>
<li><code>shop for SHOES with Amazon</code></li>
<li><code>email HELLO to Bill</code></li>
<li><code>email HELLO to ADDRESS</code></li>
<li><code>map PLACE</code></li>
<li><code>find HELLO</code></li>
<li><code>tab to HELLO</code> or <code>switch to HELLO tab</code></li>
</ol>

<p>&#8230;</p>

<h3>Pronominal/deictic arguments (aka &#8220;magic words&#8221;)</h3>

<ol>
<li><code>search this with google</code></li>
<li><code>translate this to French</code></li>
<li><code>bookmark this tab</code></li>
</ol>

<p>&#8230;</p>

<hr style='border-top: 2px gray dashed; height: 0; color: white;'/>

<h3>How this data is used</h3>

<p>Responses to these surveys would be used to identify certain salient features of the language, such as <a href="http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/">how the language codes for its arguments</a> (for example using <a href="http://en.wikipedia.org/wiki/adpositions">adpositions</a>, <a href="http://en.wikipedia.org/wiki/case marking">case marking</a>, or word order), whether the commands tend to be verb-inital or -final. Individual case markings, for example, can be identified by comparing <em><a href="http://en.wikipedia.org/wiki/minimal pairs">minimal pairs</a></em>—for example, by comparing item (1) and (2), we can learn how <code>google</code> in an instrumental role is marked, or by comparing example (2) and the &#8220;magic word&#8221; example (1), we can identify the appropriate &#8220;magic word&#8221; and determine whether the language uses any <a href="http://en.wikipedia.org/wiki/clitics">clitics</a> or not.</p>

<h3>Data collection</h3>

<p>In the future we ideally could build a web-based system to collect these &#8220;utterances.&#8221; We could also use such a system to automatically test our parsers in different languages against the sentences in the <em>command-bank,</em> or ultimately even generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/' rel='bookmark' title='Contribute: how your language identifies its arguments'>Contribute: how your language identifies its arguments</a></li>
<li><a href='http://mitcho.com/blog/projects/writing-commands-with-semantic-roles/' rel='bookmark' title='Writing commands with semantic roles'>Writing commands with semantic roles</a></li>
<li><a href='http://mitcho.com/blog/projects/ubiquity-in-firefox-japanese/' rel='bookmark' title='Ubiquity in Firefox: Focus on Japanese'>Ubiquity in Firefox: Focus on Japanese</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/feed/</wfw:commentRss>
		<slash:comments>21</slash:comments>
		</item>
		<item>
		<title>Contribute: how your language identifies its arguments</title>
		<link>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/</link>
		<comments>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/#comments</comments>
		<pubDate>Wed, 18 Feb 2009 09:37:18 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[projects]]></category>
		<category><![CDATA[arguments]]></category>
		<category><![CDATA[coding properties]]></category>
		<category><![CDATA[contribute]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[grammatical relations]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[Mozilla Planet]]></category>
		<category><![CDATA[ubiquity]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=1450</guid>
		<description><![CDATA[Earlier today I blogged on three different strategies languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs. I gathered some data from the fantastic World Atlas of Language Structures to put together a survey of many of the languages on the Internet. For each [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p>Earlier today <a href="http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/">I blogged on three different strategies</a> languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs.</p>

<p>I gathered some data from the fantastic <a href="http://wals.info/">World Atlas of Language Structures</a> to put together a survey of many of the languages on the Internet. For each of the languages, I got the canonical word order and whether the language marks the role of its argument on the verb and/or the arguments themselves.</p>

<iframe width='605' height='300' frameborder='0' src='http://spreadsheets.google.com/pub?key=pE-nN92qp_pa5P6YbUOw0HQ&#038;output=html'></iframe>

<p>As you can see, there are a number of data points that are still missing. <strong>Please contribute information on the languages you speak!</strong> You can <a href="http://spreadsheets.google.com/ccc?key=pE-nN92qp_pa5P6YbUOw0HQ">edit the spreadsheet on Google Docs</a>. Thanks!</p>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/projects/three-ways-to-argue-over-arguments/' rel='bookmark' title='Three ways to argue over arguments'>Three ways to argue over arguments</a></li>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/link/setting-language-research-to-music/' rel='bookmark' title='Setting Language Research to Music'>Setting Language Research to Music</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/projects/contribute-how-your-language-identifies-its-arguments/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>回収 vs. 収集 and Better Word Meanings Through Usage</title>
		<link>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/</link>
		<comments>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/#comments</comments>
		<pubDate>Thu, 18 Sep 2008 14:50:27 +0000</pubDate>
		<dc:creator>mitcho</dc:creator>
				<category><![CDATA[life]]></category>
		<category><![CDATA[observation]]></category>
		<category><![CDATA[Bailey]]></category>
		<category><![CDATA[cognitive linguistics]]></category>
		<category><![CDATA[corpora]]></category>
		<category><![CDATA[corpus]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[frame semantics]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Japanese language]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[language learning]]></category>
		<category><![CDATA[linguistics]]></category>
		<category><![CDATA[synonymy]]></category>
		<category><![CDATA[translation]]></category>

		<guid isPermaLink="false">http://mitcho.com/blog/?p=721</guid>
		<description><![CDATA[Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction: 回収 may take things away from others when collecting while 収集 does not have that implication. Things that you 回収 may [...]
Related posts:<ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>

Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.]]></description>
			<content:encoded><![CDATA[<p><a href="http://bpick.tumblr.com/">Bailey</a> just asked me what the difference between 回収 (<em>kaishū</em>) and 収集(<em>shūshū</em>) is—two words that would both map to the English verb &#8220;collect.&#8221; I intuitively came up with a hypothesis to explain the distinction:</p>

<ul>
<li>回収 may take things away from others when collecting while 収集 does not have that implication.</li>
<li>Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.<sup id="fnref:3"><a href="#fn:3" rel="footnote">1</a></sup></li>
</ul>

<p>Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: <a href="http://en.wikipedia.org/wiki/Google">Google</a>.<sup id="fnref:2"><a href="#fn:2" rel="footnote">2</a></sup> To test my hypothesis, I chose two &#8220;objects of collection&#8221;, one you can take away (and often is distributed first) and one you can&#8217;t take away: アンケート (<em>ankēto</em> &#8220;survey,&#8221; from the French <em>enquête</em>) and 意見 (<em>iken</em> &#8220;opinion&#8221;). I then took the four resulting collocations<sup id="fnref:1"><a href="#fn:1" rel="footnote">3</a></sup> on Google in quotes (&#8220;•&#8221;) and recorded how many hits there were.</p>

<p><span id="more-721"></span></p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;意見を収集&#8221;</th><th>&#8220;意見を回収&#8221;</th><th>&#8220;アンケートを収集&#8221;</th><th>&#8220;アンケートを回収&#8221;</th></tr>
<tr><td>218000</td><td>6200</td><td>784</td><td>169000</td></tr>
</table>

<p>A better way to organize this data is as follows:</p>

<table style="margin-left: auto; margin-right: auto;">
<tr><th>&#8220;↓を→&#8221;</th><th>回収</th><th>収集</th></tr>
<tr><th>アンケート</th><td>16900</td><td>784</td></tr>
<tr><th>意見</th><td>6200</td><td>218000</td></tr>
</table>

<p>This data clearly supports the hypothesis I laid out above: アンケート, which can be taken away from people and is often distributed first, occurs much more likely with 回収 than 収集. 意見, on the other hand, which crucially cannot be taken away when collected, occurs much more likely with 収集 than 回収.</p>

<p>While this one example doesn&#8217;t <em>prove</em> anything in and of itself, it does help clarify with data a nuance between two near synonyms. While my hypothesis was borne out here, native speaker intuitions on word nuances and distinctions can be unreliable.<sup id="fnref:4"><a href="#fn:4" rel="footnote">4</a></sup> This type of quick test can be very helpful for language learners and instructors alike.</p>

<p>Languages very often have words which vary in very subtle ways. Just this Tuesday I went to a <a href="http://linguistic.meetup.com/58/">Tokyo Language Exchange Meetup</a>, a great <a href="http://en.wikipedia.org/wiki/meetup.com">meetup</a> which brought together various language learners and enthusiasts. A hot topic that night was words with very similar meanings—near synonyms. A few English learners were lamenting sets of words like {see, view, watch} and how difficult they are to learn. I myself have had the same experience studying Mandarin.</p>

<p>I noted that these difficulties in offering contrasting definitions often are due to the fact that word meanings are not just &#8220;what the word points to&#8221; but also the implication of &#8220;what it relates to&#8221;.<sup id="fnref:5"><a href="#fn:5" rel="footnote">5</a></sup> For example, &#8220;unborn baby&#8221; and &#8220;fetus&#8221; may point to the same thing, but are used in different contexts, in contrast to different other terms, for differing effect. Similarly &#8220;Death Tax&#8221; and &#8220;Estate Tax.&#8221; &#8220;Kneel&#8221; and &#8220;genuflect.&#8221;<sup id="fnref:6"><a href="#fn:6" rel="footnote">6</a></sup></p>

<p>The concept of word meanings being &#8220;what it points to&#8221; and &#8220;what it relates to&#8221; also helps explain why certain words are difficult to translate. Fillmore uses the Japanese example of ぬるい (<em>nurui</em>) which is the de facto translation of &#8220;lukewarm.&#8221; However, some Japanese speakers will only use ぬるい in contrast with &#8220;hot,&#8221; i.e., hot tea can become ぬるい over time but ice water does not become ぬるい. In contrast, English &#8220;lukewarm&#8221; can be used to describe things that are initially or prototypically hot or cold. &#8220;What the words point to&#8221; in this case is the same but &#8220;what it relates to&#8221; or, here, &#8220;what it contrasts with&#8221; is different, making it an imperfect (though very close) translation.</p>

<p>Every language has near synonyms which vary slightly in nuance but this nuance or &#8220;feeling&#8221; is borne out objectively in data. Looking at what words certain terms relate to <em>in real usage</em> is often the key to getting a richer understanding of vocabulary.</p>

<div class="footnotes">
<hr />
<ol>

<li id="fn:3">
<p>This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (<em>mawa=ru</em>) can mean &#8220;circle back.&#8221;&#160;<a href="#fnref:3" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:2">
<p>Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the <a href="http://en.wikipedia.org/wiki/n-gram">n-gram</a> <a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">data they offer</a> for research.&#160;<a href="#fnref:2" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:1">
<p><a href="http://en.wikipedia.org/wiki/collocation">&#8221;Collocation&#8221; on Wikipedia</a> says: &#8220;Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.&#8221;&#160;<a href="#fnref:1" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:4">
<p>Hm&#8230; I just made a claim&#8230; looking for a citation.&#160;<a href="#fnref:4" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:5">
<p>&#8220;Relates to&#8221; here is not meant in an etymological sense. In <a href="http://en.wikipedia.org/wiki/frame semantics (linguistics)">frame semantics</a>, a part of <a href="http://en.wikipedia.org/wiki/cognitive linguistics">cognitive linguistics</a>, the &#8220;what the word points to&#8221; may be called a <strong>profile</strong> while the &#8220;what it relates to&#8221; is called the <strong>(semantic) frame</strong>. These distinctions are due to the work of <a href="http://en.wikipedia.org/wiki/Charles J. Fillmore">Fillmore</a> 1976.&#160;<a href="#fnref:5" rev="footnote">&#8617;</a></p>
</li>

<li id="fn:6">
<p>The great examples in this section come from Bill Croft and D. Alan Cruse&#8217;s <em>Cognitive Linguistics</em>, 2004&#160;<a href="#fnref:6" rev="footnote">&#8617;</a></p>
</li>

</ol>
</div>
<p>Related posts:</p><ol>
<li><a href='http://mitcho.com/blog/observation/testing-googles-language-detection/' rel='bookmark' title='Testing Google&#8217;s Language Detection'>Testing Google&#8217;s Language Detection</a></li>
<li><a href='http://mitcho.com/blog/life/taipei-find-a-dictionary-of-chinese-japanese-false-cognates/' rel='bookmark' title='Taipei find: a dictionary of Chinese-Japanese false cognates'>Taipei find: a dictionary of Chinese-Japanese false cognates</a></li>
<li><a href='http://mitcho.com/blog/life/the-most-beautiful-word/' rel='bookmark' title='The Most Beautiful Word'>The Most Beautiful Word</a></li>
</ol>
<p>Related posts brought to you by <a href='http://yarpp.org'>Yet Another Related Posts Plugin</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://mitcho.com/blog/observation/%e5%8f%8e%e9%9b%86-vs-%e5%9b%9e%e5%8f%8e-and-better-word-meanings-through-usage/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

