Automating the Linguist’s Job
At the end of my blog post yesterday I hinted at an exciting possible approach to Ubiquity’s localization:
In the future we ideally could build a web-based system to collect these “utterances.” We could … generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.
If we build this type of “command-bank” of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially learning the language based on data. Today I’ll elaborate on how I believe this could be possible, by analogy to another language learning device: the human.
Step 1: learning words
How does a human learn language? Without getting into any details or theory, we can say that the input for a language learner is always a combination of linguistic input and a referent. In the case of a child, this could be a pairing of linguistic input with real world stimulus:
| input | referent |
|---|---|
| “taiyaki!” | ![]() by makitani via creative commons. |
| “cat!” | ![]() by victoriachan via creative commons. |
The human child will hear “cat” while looking at the cat and, with time and repetition, learn that that thing is called a “cat,” and some other thing is called “taiyaki.”
Similarly, we could take single-verb data points from our command-bank to match new words with a know referent—in this case, the base English string. Here’s an example from Jan’s comment on yesterday’s sample survey.
| input (Dutch) | referent (English) |
|---|---|
| zoek | search |
Step 2: deduction
Now suppose we know some single words like “taiyaki” and “cat.” Consider the two situations. Given the first sentence and referent “mitcho’s eating a taiyaki,” the child could intuit the appropriate linguistic representation for the latter situation.
| input | referent |
|---|---|
| “mitcho’s eating a taiyaki!” | ![]() |
| ??? | ![]() |
The process is simple. First note that there is only one variable changed between the two situations: the taiyaki has been replaced by a cat head. You can then construct the correct utterance by analogy, replacing “taiyaki” with “cat,” yielding “mitcho’s eating a cat!”1
Similarly, we could build a tool to analyze the data in a translated command-bank to identify particular features of each language, generating at least basic parsers for each language. Such a task would require a number of minimal pairs in our data set—here’s one such example from yesterday’s survey (with Dutch data from Jan):
| input (Dutch) | referent (English) |
|---|---|
| zoek HELLO met Google |
search HELLO with Google
|
| zoek dit met Google |
search this with Google
|
A simple string analysis2 would tell us that the text HELLO was replaced by dit in the latter Dutch sentence. Meanwhile, since the English reference sentence is chosen manually, we also know the appropriate parses for each of those sentences. An object difference operation would note that the object property was changed from a value of 'HELLO' to 'this'. We could then map dit to the English this. We’ve now learned one (of perhaps many) Dutch deictic pronouns (aka “magic words”).
Given an adequately universal but customizable parser design, we can then develop tests for various parameters by constructing appropriate minimal pairs in the base sentences and having them translated.3 As noted yesterday, such a system could reduce the laborious task of writing individual parsers to a task of string translation, which our community does exceedingly well. I’m eager to hear what others think of this approach. What concerns would you have for this approach? What potential benefits do you see?
-
I mean no offense to human children with this simplified example. Surely you can learn more than just string replacements. ↩
-
I started building some string analysis toys in JavaScript today, such as a Levenshtein difference demo. ↩
-
The linguists in the audience may note that this parser’s modular design is indeed in the spirt of the principles and parameters framework. ↩
Related posts:
- Ubiquity i18n: questions to ask
- Where’s The Verb?
- Localizing Ubiquity: an open letter to linguists
- Inside the Argument
- Rolling out the Roles
Related posts brought to you by Yet Another Related Posts Plugin.
Tags: analogy, automation, code, data, deduction, Dutch, linguistics, Mozilla Planet, parser, patterns, ubiquity
If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!




March 29th, 2009 at 11:03 pm
I can see it would have to be designed carefully to allow for languages that change other words based on the ones that can be substituted. A simplistic example, because I can't think of a better one off the top of my head: "search for the X with google", in English you can pretty much put any noun in there, but in Dutch: "zoek voor de kat met google", but simply replacing X with 'office': "zoek voor het kantoor met google".
In this example, it's easy fixed by making X be 'article noun', but I wonder if there are languages where the gender, number, or whatever of the noun-phrase affects the conjugation of the verb, which would make a simple replacement harder.
Just thinking aloud really, the stuff in this post seems like a good idea.
October 14th, 2009 at 1:03 pm
It is very difficult to automate any linguist's job as it is a rather complicated process.
November 16th, 2009 at 6:21 am
In Romanian is different, you don't merely replace X with office, for example, as the gender is different. If you say ' cauta X-ul cu Google (is masculine) and cauta cartea cu Google (cartea - book - is feminine). Ok office was also masculine so I couldn't use it here.