mitcho Michael 芳貴 Erlewine

Linguist. Fifth year PhD student at MIT.


Posts Tagged ‘parser’

Mashing up the browser in Maine

Saturday, December 19th, 2009

Last week I was invited to give a talk at the TechMaine annual conference in Portland, Maine.

Being a longer time slot than I previously have used to talk about Ubiquity, I decided to dedicate a good portion of the talk to Jetpack. Being outside of Mozilla for the past few months, this gave me an opportunity to get reacquainted with the Jetpack APIs. I myself was impressed by how easy it was to develop a quick Jetpack. I ended up preparing two to live-code during the talk: one called Helvetica which, with one click, replaces all fonts on the current page with Helvetica; and You Are Here which uses an open API from IPinfoDB to display the physical location of the domain you are currently visiting in the status bar. Both are now on the Jetpack Gallery.

Unfortunately there was a bit of a snowstorm leading up to the event, but there was still a nice turnout and I got to meet some fantastic people there. Ken Shoemake of [[slerp]] and [[quaternion]] fame came up to me after my talk and said “the Ubiquity parser reminded me of the dancing bear… it’s less surprising that it works well as that it works at all.” :) I also enjoyed the other great presentations in the technology track, covering the virtues of REST and basic iPhone development.

Mashup the Browser with Ubiquity and Jetpack

Performance vs Responsiveness —or— How I Made the Parser Twice As Fast in One Day

Thursday, August 13th, 2009

Since we launched Ubiquity 0.5, the issue of Parser 2 performance has been brought up over and over within the community. By virtue of having a more flexible and localizable design, Parser 2 was expected to be slower than our original parser, but its current implementation felt noticeably—perhaps unnecessarily—slow compared to Parser 1. Parser 2 performance has been identified as one of the blockers for pushing Ubiquity 0.5+ to all of our 0.1.x users, and has thus been one of my recent foci.

The short story:

Inspired by some comments by Blair, yesterday I was able to make significant (roughly 100%) performance gains in Parser 2, resulting in 40-60% faster parses, depending on the query. This change has been committed and will be released as part of our forthcoming minor update, Ubiquity 0.5.4. Yay!


Converting your Ubiquity command to Ubiquity 0.5

Tuesday, July 21st, 2009

Converting your Ubiquity command to Ubiquity 0.5 from mitcho on Vimeo.

This video walks through the process of converting your Ubiquity commands to Ubiquity 0.5 with Parser 2. For more information, please consult the command conversion tutorial.

Ubiquity Localization: What’s New, What’s Next

Thursday, July 9th, 2009

Yesterday we released Ubiquity 0.5, a major update to the already popular Ubiquity platform. Among numerous other features, Ubiquity 0.5 includes the first fruit of months of research on building a multilingual parser and natural language interface. In this blog post I’ll give a quick overview of new internationalization-related features in Ubiquity 0.5 as well as a quick roadmap of future considerations.

Of course, one of the best ways to learn about the new features is to experience them… try Ubiquity 0.5 now!

Install now!


Ubiquity 0.5 日本語紹介ビデオ

Thursday, July 2nd, 2009

今夜リリースされる Ubiquity の最新版、0.5 に備えて日本語で Ubiquity のスクリーンキャストを作ってみました。 Ubiquity 0.5 は特に多言語化を重視したリリースで、 Ubiquity 内蔵のコマンドが日本語とデンマーク語で使えるようになっています。是非インストールしてみてください!

追伸: 7月3日現在、 Ubiquity 0.5 のリリースを遅らせる方向になったので、残念ながら今日はリリースされません。是非リリース後インストールしてみてください。

Ubiquity 0.5 日本語紹介ビデオ from mitcho on Vimeo.

As Ubiquity 0.5 will be released soon (Thursday morning in Mountain View), I decided it was a good time to put together a screencast in Japanese demoing the use of the new Japanese parser and commands.

Ubiquity presentation at Tokyo 2.0

Wednesday, June 10th, 2009


This past Monday I presented at Tokyo 2.0, Japan’s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global Voices Online on community translation.

I just put together a video of my Ubiquity presentation, mixing the audio recorded live at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.

Ubiquity: Command the Web with Language 言葉で操作する Web from mitcho on Vimeo.


Changes to Ubiquity Parser 2 and the Playpen

Friday, June 5th, 2009

Here’s a quick screencast highlighting some of the changes to Parser 2 and the updated Parser 2 Playpen. This video should be particularly useful to people hoping to add their language to Parser 2. It’s also a good reference for Ubiquity core developers.

Changes to Ubiquity Parser 2 + Playpen from mitcho on Vimeo.

All the features covered, as with all Parser 2 features, require that you get the latest Ubiquity code from our Mercurial repository.

Big Issues and Small Issues with Parser 2

Wednesday, May 20th, 2009

Jono and I had a good conversation this morning on IRC about the remaining Big Issues which are blocking the release of Parser 2 as the default parser for Ubiquity. Here are our Top 4 Big Issues:

  1. Some commands’ preview’s and execute’s are not working properly (trac #652). This could be an underlying issue with some pipes not rerouted correctly in Parser 2, or it could be that the commands have not been rewritten correctly to take advantage of Parser 2.
  2. Flesh out how to localize resources, like commands and nountypes. We started a conversation on this subject a few weeks ago but we never reached a resolution. This blocks issues 3 and 4 below.
  3. We need to standardize a format for commands for Parser 2. As noted in last week’s meeting (among other places) Parser 2 will require at least some modification to all commands. Jono and I came up with a simple hybrid format for commands which specify takes and modifiers for Parser 1 and arguments for Parser 2, but until we figure out how exactly the localization of commands will work, we can’t write a definitive standard.
  4. Enable nountype localization. While the most popular nountypes used are those that ship with Ubiquity, it is important to come up with a localization process which can apply to custom nountypes as well. Nountype localizations need the ability to either (1) replace the _name only, or (2) replace both the _name and the suggest() logic, as both cases will be necessary.

Given that Big Issue 3 and Big Issue 4 are both dependent on Big Issue 2, there clearly needs to be a continued public discussion of how we should make these resources localizable. I look forward to this discussion taking place at tomorrow’s joint (general + i18n) Ubiquity meeting.

In other news, here are some Small Issues:

  1. Add a switch for parser version and language settings: Jono’s already made a space for this in the new “Settings and Skins” page in about:ubiquity. He’s on it. Like a bonnet.
  2. Magic word (anaphor) substitution is not yet working properly. This needs to work both when there is an explicit magic word and when there are simply missing arguments.
  3. The position of suggested verbs is always sentence-initial (trac #655). This also requires that we can specify whether verb name localizations are sentence-initial forms or sentence-final forms.1

Let’s hit the code!

  1. German, Dutch, and Greek, for example, are all languages where there are both command verb forms which are sentence-initial and sentence-final. 

Notes from BarCamp Tokyo 2009

Monday, May 18th, 2009

This past Saturday was Tokyo BarCamp 2009 at Sun’s Yoga offices. I of course gave a presentation on Ubiquity and our recent localization efforts, including Parser 2. As you can see, I signed up quickly:

CC-BY-NC iMorpheus

Here are the slides I used in that session. There are two “demo” sections in the slides… the first was a simple demo of Ubiquity 0.1.x showing off the translate, map, and edit-page commands. The second demo was of Ubiquity Parser 2 and showing off how little code it takes to add your language to Ubiquity with Parser 2.


Ubiquity in Italian!

Monday, May 18th, 2009

Thanks to the great work of Sandro Della Giustina, we now have a preliminary Italian parser for use with Ubiquity Parser 2. Sandro brought up a good point, however, about Italian prepositions which contract with the article and the head noun. For example,

traduci   dall'inglese     al     cinese
translate from=the=English to=the Chinese

One current solution is to add [[zero-width space|zero-width spaces]] after these contracted articles, all’ and dall’.1 The appropriate way to add this to the parser is by defining a custom wordBreaker() method.

it._patternCache.contractionMatcher = new RegExp('(^| )(all\'|dall\')','g');
it.wordBreaker = function(input) {
  return input.replace(this._patternCache.contractionMatcher,'$1$2\u200b');

Grazie Sandro!

  1. As John Daggett pointed out to me, in the future we may have to add an intermediate shallow parse instead of adding characters (in this case, the zero-width space) to the modified input. 

Inside the Argument

Wednesday, May 13th, 2009

Here’s a little picture of the different sections of text in a single parsed argument and which properties of the resultant argument object they are assigned to.


You’ll see, from left to right, outerSpace, modifier, innerSpace, inactivePrefix, input/data, inactiveSuffix.

The example text is from the Catalan example, “compra mitjons amb el Google,” meaning “buy socks with Google.” You’ll notice the argument “amb el Google” is literally “with the Google.” The normalizeArgument() method of the Catalan parser, as I described earlier this week, strips the article “el ” and puts it in the inactivePrefix property of the argument.

I’m going to spend the rest of the day updating Parser 2 design doc and related documentation so they match these and other recent developments in the parser.

Solving Another Romantic Problem: Weak Pronouns

Tuesday, May 12th, 2009

Yesterday I blogged on how to deal with portmanteau’ed prepositions in Ubiquity Parser 2, a common problem in various romance languages. Today I’ll propose an approach to another romantic problem.

The problem:

Weak pronouns in [[romance languages]] (as well as some other languages) have a special property where they cliticize to the verb, moving from its regular argument position to a position next to the verb. For example, in French, we have an imperative like (1) with gloss as (2):

  1. Envoyez  le  lettre à  Pierre!
  2. send.IMP the letter to Pierre

If we replace le lettre or à Pierre with a preposition (le, “it”, or lui, “to him”, respectively), those weak pronouns move next to the verb—in particular, (5) exemplifies the change in word order. Replacing both arguments with prepositions creates the stacked clitic form of (7).1

  1. Envoyez-la à  Pierre!
  2. send   -it to Pierre
  3. Envoyez-lui la  lettre!
  4. send   -him the letter
  5. Envoyez-le-lui!
  6. send   -it-him

The fact that these weak pronouns are attached to the verb and lack separate delimiters mean that we will need a separate mechanism to parse these arguments: indeed, this functionality has been planned in Ubiquity Parser 2 as “step 3”. Here I’ll examine some data and discuss a strategy for the parsing of weak pronouns.


  1. Note that the reverse order of “Envoyez-lui-le” is ungrammatical… fortunately we most likely will not have to deal with multiple clitics… see footnote two below. 

Solving a Romantic Problem: Portmanteau’ed Prepositions

Monday, May 11th, 2009

The problem:

In many [[romance languages]], prepositions and articles often form [[portmanteau|portmanteau morphs]], combining to form a single word.1 Some examples include (French) à + le > au, de + le > du, (Catalan) a + el > al, de + les > dels, per + el > pel. Italian has a particularly productive system of portmanteau’ed prepositions and articles… I refer you to the [[Contraction (grammar)#Italian|contraction]] article on Wikipedia.

As I noted a couple weeks ago, however, some combinations do not form portmanteaus.2


  1. Thanks to Jeremy O’Brien for helping me figure out how to refer to this phenomenon. 

  2. This also relates to the issue of parsing multi-word delimiters, though the argument normalization strategy covered here should reduce the necessity of multi-word delimiters. 

In Case of Case…

Wednesday, May 6th, 2009

A recently hot topic of discussion in the Ubiquity i18n realm has been how to deal with strongly case-marking languages. As we continue to make steady progress, this is one of remaining open questions which we must decide as a community how to tackle in Parser 2.


[[Grammatical case]] is a marking on nouns that express grammatical function. Not all languages exhibit case. In many of the Indo-European languages we hope to bring Ubiquity to, case is realized as a suffix.1

Here’s a classic example of case from [[Latin]]. (Line 2 is the gloss of 1, line 4 of 3.)

  1. canis      virum      momordit
  2. dog=sg.NOM man=sg.ACC bite=3sg.perfect
  3. vir        canem      momordit
  4. man=sg.NOM dog=sg.ACC bite=3sg.perfect

Example (1) is “the man bit the dog,” while example (3) is “the dog bit the man.” The only difference, as you see in the gloss, is that the nouns canis and vir are marked with different case endings in the two sentences. By marking the nouns with different cases (here, [[nominative]] and [[accusative]]), their semantic roles in the sentence—which is the the biter and which is the bitee—can be identified unambiguously. (Their positions are also switched in these examples but in reality Latin has a very free word order—the same sentences with other word orders including OSV or VSO are also common.)

At first glance, strongly case-marked languages may look like a godsend for identifying the semantic roles of arguments.2 If we can easily and unambiguously recognize arguments’ cases to put them in their appropriate semantic roles, this could simplify processing as well as make Ubiquity input follow a natural syntax for such languages. Unfortunately, there are some significant challenges which must be overcome in order to make the processing of case-markers worthwhile.


  1. Note that when linguists talk about “case,” they could be referring to two different (though related) concepts: case (lowercase) is the observed pattern of affixes on nouns which indicate grammatical function, while Case (uppercase) refers to a theoretical (formal) feature of syntactic objects—certain lexical items “assign Case” or “receive Case” and its mismatches were ruled out in [[Government and binding theory|GB]] syntax by the Case Filter. You’ll find GB linguistics papers referring to “case” when discussing Mandarin Chinese, for example, a language that doesn’t have any overt case (lowercase) and you’ll know immediately that this usage is an uppercase Case case. In this blog post I’ll be dealing primarily with the former descriptive notion. 

  2. When I refer to “strongly case-marking languages,” I am referring to languages with a non-trivial inventory of cases (not just nominative, accusative, and genitive) and where a noun phrase’s case is not reflected on [[determiner (class)|determiners]]. For example, [[German language|German]] is excluded by this definition as case is realized exclusively on articles and there is no need to find and parse the noun head itself to identify its case—more information on German is in the section “finding the edges.” 

Adding Your Language to Ubiquity Parser 2

Wednesday, April 29th, 2009

NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow these instructions.

You’ve seen the video. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please submit your (even incomplete) language files!

As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the Ubiquity Planet and/or this blog (RSS).