mitcho Michael 芳貴 Erlewine

Linguist. Fifth year PhD student at MIT.


Posts Tagged ‘language’

Ubiquity Localization Update

Friday, June 12th, 2009

As we move closer and closer to shipping a Ubiquity with there is still much work to be done, particularly in the area of localization. In a recent Ubiquity meeting we laid out the explicit localization goals and non-goals of as follows:

  • Goals for 0.5
    • Parser 2 (on by default)
    • underlying support for localization of commands
    • localization of standard feed commands for a few languages
    • Parser 2 language files for those same languages
  • Nongoals for 0.5
    • distribution/sharing of localizations
    • localization of nountypes

The overall goal for this release of Ubiquity is to come up with a format and standard for localization. Localizations in Ubiquity 0.5 will only apply to commands bundled with Ubiquity, and the localization files themselves will be distributed with Ubiquity. In a future release we will tackle the problem of localizations for commands in the wild and truly croud-source1 this process.


  1. Or “cloud-source”… finally a Japanese accent joke that’s semantically stable! 

Ubiquity presentation at Tokyo 2.0

Wednesday, June 10th, 2009


This past Monday I presented at Tokyo 2.0, Japan’s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global Voices Online on community translation.

I just put together a video of my Ubiquity presentation, mixing the audio recorded live at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.

Ubiquity: Command the Web with Language 言葉で操作する Web from mitcho on Vimeo.


Changes to Ubiquity Parser 2 and the Playpen

Friday, June 5th, 2009

Here’s a quick screencast highlighting some of the changes to Parser 2 and the updated Parser 2 Playpen. This video should be particularly useful to people hoping to add their language to Parser 2. It’s also a good reference for Ubiquity core developers.

Changes to Ubiquity Parser 2 + Playpen from mitcho on Vimeo.

All the features covered, as with all Parser 2 features, require that you get the latest Ubiquity code from our Mercurial repository.

The Hit List: Better Software Through Less UI

Wednesday, March 25th, 2009

The Hit List is a to-do list app for Mac OS X with a beautiful interface and some nice features. Creator Andy Kim’s latest blog post (Better Software Through Less UI) is excellent reading for the Ubiquity community. He describes the thought process behind the design of a new clean and “frictionless” interface for specifying how tasks are repeated. After throwing out the regular combinations and templates of different input widgets, his solution was to implement a partial natural language input interface:

There is no myriad of buttons and fields to choose from. All the user has to do is directly type in what he wants.

Here are a couple other choice quotes which will ring true for the Ubiquity users and internationalization folks in the audience:

For this to work without driving the user mad, the natural language parser has to be near perfect. The last thing I want is for this to come out smelling like AppleScript.

This design isn’t perfect as it has two glaring problems. One is that the user has no easy way of discovering how complex the recurrence rules can be. This isn’t such a huge problem, but a way to solve this is to include a help button to show example rules or to include an accompanying iCal style UI to let the user setup the recurrence rule in a more typical fashion. I didn’t include these in the initial implementation though because I wanted to see how users would react to this kind of UI.
Another problem is localization. Even if I write parsers for a few more popular languages, it won’t accommodate the rest of the users in the world. Again, the solution is an accompanying traditional UI, but for now, I’m leaving it the way it is until I get some feedback.

There’s a trend in the wind, my friends: the incorporation of near-natural language for more humane interfaces.

User-Aided Disambiguation: a demo

Saturday, March 14th, 2009

A few weeks ago I made some visual mockups of how Ubiquity could look and act in Japanese. Part of this proposal was what I called “particle identification”: that is, immediate in-line identification of delimiters of arguments, which can be overridden by the user:

The inspiration for this idea came from Aza’s blog post “Solving the ‘it’ problem” which advocates for this type of quick feedback to the user in cases of ambiguity. Such a method would help both the user better understand what is being interpreted by the system, as well as offer an opportunity for the user to correct improper parses. I just tried mocking up such an input box using jQuery.

Try the User-Aided Disambiguation Demo

If you have any bugfixes to submit or want to play around with your own copy, the demo code is up on BitBucket. ^^ Let me know what you think!

Contribute: how your language identifies its arguments

Wednesday, February 18th, 2009

Earlier today I blogged on three different strategies languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs.

I gathered some data from the fantastic World Atlas of Language Structures to put together a survey of many of the languages on the Internet. For each of the languages, I got the canonical word order and whether the language marks the role of its argument on the verb and/or the arguments themselves.

As you can see, there are a number of data points that are still missing. Please contribute information on the languages you speak! You can edit the spreadsheet on Google Docs. Thanks!

How natural should a natural interface be?

Monday, February 16th, 2009

I’m very happy to announce that, starting today, I will be working full-time on Ubiquity, a Mozilla Labs experiment to connect the web with language. I’ll be heading up research on different linguistic issues of import to a linguistic user interface and blogging about these topics here. If you’re interested, please subscribe to my blog’s RSS feed or the RSS feed for only Ubiquity-related items. Commenting is encouraged! ^^

Every day, more users are trying out Ubiquity, the Mozilla Labs experiment that lets users accomplish common Internet tasks faster through a natural language interface. As we live more and more of our lives on the web, there is a huge appeal to—and need for—a faster way to access and mashup our information.

But what exactly do we mean by a “natural language interface”? Is it just another programming language with lots of English keywords? Should the final goal be a computer that understands everything we tell it?

Ubiquity is not HAL

As we think about the future directions and possibilities of Ubiquity, we need to go back to our roots and understand the project’s motivations. With that in mind, here are some initial thoughts on the advantages of a natural language interface. The ultimate goal here is to refine the notion of natural language interface and to come up with a set of principles that we can follow in pushing Ubiquity further, into other languages and beyond.


回収 vs. 収集 and Better Word Meanings Through Usage

Thursday, September 18th, 2008

Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:

  • 回収 may take things away from others when collecting while 収集 does not have that implication.
  • Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1

Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: [[Google]].2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.


  1. This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.” 

  2. Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the [[n-gram]] data they offer for research. 

  3. [[collocation|”Collocation” on Wikipedia]] says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.” 

Testing Google’s Language Detection

Saturday, May 17th, 2008

google code

As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient [[universal translator]] of the world’s popular languages. Google’s handling of languages of course isn’t perfect, however—in particular, I’ve been complaining to friends for a while about the weaknesses of Google’s handling of queries in Chinese character ([[Chinese characters|漢字/汉字]]) scripts. In this post, I run some tests using Google’s Language Detection service to try to better understand its handling of Chinese character queries.


Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the “CJKV”). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there’s [[Chinese character|Wikipedia]] or Ken Lunde’s CJKV Information Processing.

The problem

The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, Google doesn’t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.[^1]


Sign language SuperBowl ad

Monday, February 4th, 2008

I don’t care much for the game, but always love checking out the SuperBowl ads every year… this year there was something really cool… a sign language ad by a deaf group at PepsiCo.1 Very cool.

The crew has their own website at Pepsi too: Bob’s House.

  1. “ad”, used loosely… does this ad sell anything? 

Patricks Nortons on Tekzillaz

Wednesday, January 9th, 2008

I just noticed something on the latest Tekzilla Daily: [[Patrick Norton]], host of Tekzilla and former host of [[the Screen Savers]] says “there’s a lots to learn here” (1:28) and then later “the site you’re having troubles with” (1:39). While “having troubles with…” is fine, I believe “having trouble with…” is much more common. As for “a lots to learn,” however, that’s definitely out. Is it hyperarticulation? I don’t know.

Wikipedia notes: “Norton grew up in the [[Midwest]], but considers the [[Jersey Shore]] his home… He currently lives in [[San Francisco, California]].” So, is this a Jersey Shore or California thing? I have no idea.

Setting Language Research to Music

Monday, December 24th, 2007

Via LinguistList:

‘Setting Language Research to Music’ is a Newcastle University project whose aim is to compose orchestra and choral music to demonstrate infant perception and production. The first piece of music to emerge from the project, ‘Swing Cycle’, mimics babies’ experience of discovering word boundaries, taking work by Peter Jusczyk and colleagues as a starting point.

It’s the craziest thing I’ve seen in a long while… it reminds me of the Music: Materials and Design course I took a couple years ago. My final project was an electronic composition building a rhythm with political speech samples and echos and cracking noises, representing the hollowness of political rhetoric. It was one of my academic low points at Chicago, for sure.

Maybe it’s because I’m an artist, but I’ve never understood the drive for modern art, including compositions like these. I would much rather listen to some music and read about language acquisition separately… the motivation to combine the two eludes me.

You can listen to The Swing Cycle and read the lyrics (or their approximation) on the Setting Language Research to Music website.

Eats, shoots, and leaves

Monday, December 17th, 2007

I just read Clause and Effect (via DF), a great editorial discussing commas in the second amendment and their effects on interpretation of the law. I found this timely as Bailey and I just watched Institutional Memory, the penultimate episode of The West Wing, where Toby Ziegler discusses a comma in the fifth amendment’s takings clause: “nor shall private property be taken for public use[,] without just compensation.” BBC’s H2G2 has a pretty good write-up and there’s a listing of relevant links as well.

The funny thing about all of these is that we don’t speak commas. It’s used to graphically represent pauses in speech, but are often used according to certain artificial rules which, when used systematically, aim to help the reader parse the sentence or help disambiguate between different readings.1

I’m surprised Language Log hasn’t picked up this new piece yet. UPDATE: Yup, they got to it. Great coverage, as always.

  1. We use pauses in spoken language to do this too, but not necessarily in the same places that we place commas in “good” written language.