Automating the Linguist’s Job

Mar 24, 2009

At the end of [my blog post yesterday][1] I hinted at an exciting possible approach to Ubiquity’s localization:

In the future we ideally could build a web-based system to collect these “utterances.” We could … generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.

If we build this type of “command-bank” of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially learning the language based on data. Today I’ll elaborate on how I believe this could be possible, by analogy to another language learning device: the human.

Step 1: learning words

How does a human learn language? Without getting into any [[language acquisition

details or theory]], we can say that the input for a language learner is always a combination of linguistic input and a referent. In the case of a child, this could be a pairing of linguistic input with real world stimulus:

</p>

input	referent
“taiyaki!”	by makitani via creative commons.
“cat!”	by victoriachan via creative commons.

</center>

The human child will hear “cat” while looking at the cat and, with time and repetition, learn that that thing is called a “cat,” and [[taiyaki|some other thing]] is called “taiyaki.”

Similarly, we could take single-verb data points from our command-bank to match new words with a know referent—in this case, the base English string. Here’s an example from Jan’s comment on yesterday’s sample survey.

</p>

input (Dutch)	referent (English)
zoek	search

</center>

Step 2: deduction

Now suppose we know some single words like “taiyaki” and “cat.” Consider the two situations. Given the first sentence and referent “mitcho’s eating a taiyaki,” the child could intuit the appropriate linguistic representation for the latter situation.

</p>

input	referent
“mitcho’s eating a taiyaki!”
???

</center>

The process is simple. First note that there is only one variable changed between the two situations: the taiyaki has been replaced by a cat head. You can then construct the correct utterance by analogy, replacing “taiyaki” with “cat,” yielding “mitcho’s eating a cat!”

Similarly, we could build a tool to analyze the data in a translated command-bank to identify particular features of each language, generating at least basic parsers for each language. Such a task would require a number of [[minimal pairs]] in our data set—here’s one such example from yesterday’s survey (with Dutch data from Jan):

</p>

input (Dutch)	referent (English)
zoek HELLO met Google	search HELLO with Google `<pre>Parse { verb: 'search', arguments: { object: ['HELLO'], service: 'Google' } }</pre>`
zoek dit met Google	search this with Google `<pre>Parse { verb: 'search', arguments: { object: ['this'], service: 'Google' } }</pre>`

</center>

A simple string analysis would tell us that the text HELLO was replaced by dit in the latter Dutch sentence. Meanwhile, since the English reference sentence is chosen manually, we also know the appropriate parses for each of those sentences. An object difference operation would note that the object property was changed from a value of 'HELLO' to 'this'. We could then map dit to the English this. We’ve now learned one (of perhaps many) Dutch deictic pronouns (aka “magic words”).

Given an adequately universal but customizable parser design, we can then develop tests for various parameters by constructing appropriate [[minimal pairs]] in the base sentences and having them translated. As noted yesterday, such a system could reduce the laborious task of writing individual parsers to a task of string translation, which our community does exceedingly well. I’m eager to hear what others think of this approach. What concerns would you have for this approach? What potential benefits do you see?

I mean no offense to human children with this simplified example. Surely you can learn more than just string replacements.

I started building some string analysis toys in JavaScript today, such as a Levenshtein difference demo.

The linguists in the audience may note that this parser’s modular design is indeed in the spirt of the [[principles and parameters]] framework.

[1]: http://mitcho.com/blog/projects/ubiquity-i18n-questions-to-ask/