blog

Adding Your Language to Ubiquity Parser 2

NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow these instructions.

You’ve seen the video. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please submit your (even incomplete) language files!

As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the Ubiquity Planet and/or this blog (RSS).

Set up your environment

If you’re new to Ubiquity core development, you’ll want to first read the Ubiquity 0.1 Development Tutorial to learn how to get a live copy of the Ubiquity repository using Mercurial. Once you’ve set up your Firefox profile to use this development version, make sure to try changing the extensions.ubiquity.parserVersion value to 2 in about:config (as seen in this demo video) to verify that Parser 2 is working for you.

As you read along, you may find it beneficial to follow along in the languages currently included in Parser 2: English, Japanese, Portuguese, and Swedish (and the incomplete Chinese and French).

The structure of the language file

Each language in Parser 2 gets its own file which acts as a JavaScript module. You’ll need to look up the ISO 639-1 code for your language… Here we’ll use English (code en) as an example here and the JavaScript language file would then be called en.js and go in the /ubiquity/modules/parser/new/ directory of the repository.

Here is the basic template for a Ubiquity Parser 2 language file:

1
2
3
4
5
6
7
8
9
10
11
12
var EXPORTED_SYMBOLS = ["makeEnParser"];
 
if ((typeof window) == 'undefined') // kick it chrome style
  Components.utils.import("resource://ubiquity/modules/parser/new/parser.js");
 
function makeEnParser() {
  var en = new Parser('en');
 
...
 
  return en;
};

After lines 1-4 which set up the JavaScript module, everything else is wrapped in a factory function called makeLaParser (for Latin) or makeEnParser (for English, en) or makeFrParser (for French, fr), etc. This function initializes the new Parser object (line 7) with the appropriate language code, sets a bunch of parameters (elided above) and returns it. That’s it!

Now let’s walk through some of the parameters you must set to get your language working. For reference, the properties the language parser object is required to have are: branching, anaphora, and roles.

Identifying your branching parameter

  en.branching = 'right'; // or 'left'

One of the first things you’ll have to set for your parser is the branching parameter. Ubiquity Parser 2 uses the branching parameter to decide which direction to look for an argument after finding a delimiter or “role marker” (most often, these are prepositions or postpositions. For example, in English “from” is a delimiter for the goal role and its argument is on its right.

   
toMaryfromJohn

So “John” is a possible argument for the source role, but “Mary” should not be. Ubiquity can figure this out because English has the property en.branching = 'right'.

In Japanese, on the other hand, the argument of a delimiter like から (“from”) is found on the left of that delimiter, so en.branching = 'left'.

   
メアリー-からジョン-に
MaryfromJohnto

In general, if your language has prepositions, you should use .branching = 'right' and if your language has postpositions, you can use .branching = 'left'.

For more info:

Defining your roles

  en.roles = [
    {role: 'goal', delimiter: 'to'},
    {role: 'source', delimiter: 'from'},
    {role: 'position', delimiter: 'at'},
    {role: 'position', delimiter: 'on'},
    {role: 'alias', delimiter: 'as'},
    {role: 'instrument', delimiter: 'using'},
    {role: 'instrument', delimiter: 'with'}
  ];

The second required property is the inventory of semantic roles and their corresponding delimiters. Each entry has a role from the inventory of semantic roles and a corresponding delimiter. Note that this mapping can be many-to-many, i.e., each role can have multiple possible delimiters and different roles can have shared delimiters. Try to make sure to cover all of the roles in the inventory of semantic roles.

For more info:

Entering your anaphora (“magic words”)

  en.anaphora = ["this", "that", "it", "selection", "him", "her", "them"];

The final required property is the anaphora property which takes a list of “magic words”. Currently there is no distinction between all the different deictic anaphora which might refer to different things.

Special cases

Some special language features can be handled by overriding the default behavior from Parser. Many of these features are still in the works, however, so we’d love to get your comments!

Languages with no spaces

If your language does not delimit arguments (or words, more generally) with spaces, there will be a need to write a custom wordBreaker() function and set usespaces = false and joindelimiter = ''. For an example, please take a look at the Japanese or Chinese.

Case marking languages

If you have a strongly case-marked language, you’ll have to write some rules to identify those different cases in wordBreaker() and then add some extra roles for these case markers, but for a number of languages the current design does not allow an elegant solution for parsing such arguments. Updates to this issue will be posted to this trac ticket.

In the mean time, however, if you could write a parser even with only the prepositions/postpositions in your language, that would be a great benefit in getting started in your language. UPDATE: a proposal on how to deal with strongly case-marked languages has been written here: In Case of Case….

Stripping articles

Some languages have some delimiters which combine with articles. For example, in French, the preposition “à” combines with the masculine definite article “le” but not “la”:

  1. à + la = à la
  2. à + le = au

You can add both “à” and “au” as delimiters of the goal role, but then you will get feminine arguments back with the determiner (e.g. “la table”) while masculine arguments would be parsed without a determiner (e.g. “chat”).

  1. à la table” = “to the table”
  2. au chat” = “to the cat”

One possible solution to this is to write a custom cleanArgument() method. After arguments have been parsed and placed in their appropriate roles, each argument text (say, “la table” or “chat”) are passed to cleanArgument(). You can simply write a cleanArgument() to strip off any “la ” at the beginning of the input and return it and both example inputs will get normalized arguments: “table” and “chat”, respectively. UPDATE: For more up-to-date information on how to deal with these types of articles, please see Solving a Romance Problem.

Test your parser

Now you can go into about:config and change extensions.ubiquity.language to be your language code and restart. All the verbs and nountypes at this point will remain the same as in the English version, but it should obey the argument structure (the word order and delimiters) of your language.1 If you run into any trouble, feel free to ask for help on the Ubiquity i18n listhost or find me on the Ubiquity IRC channel (mitcho @ irc.mozilla.org#ubiquity). Of course, once you’re at a good stopping point, please contribute your language file to Ubiquity!

More to come…

At this point, you’ve only localized the argument structure of your language… additional work will be required to localize the nountypes and verb names, which is the subject of ongoing discussionjoin the Google Group to get in on the discussion!


  1. At this point in time it’s also possible to test your parser at chrome://parser-demo/content/index.html if you make a couple other changes to your code… for more information, watch the Foxkeh demos Ubiquity Parser TNG video. This option gives you more debug info as well. 

Related posts:

  1. Ubiquity Parser: The Next Generation Demo
  2. Rolling out the Roles
  3. Writing commands with semantic roles
  4. Contribute: how your language identifies its arguments
  5. Ubiquity Commands by The Numbers

Related posts brought to you by Yet Another Related Posts Plugin.

Tags: , , , , , , , , , , , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

3 Responses to “Adding Your Language to Ubiquity Parser 2”

  1. Solving a Romantic Problem Says:

    […] As I noted a couple weeks ago, however, some combinations do not form portmanteaus.2 […]

  2. Notes from BarCamp Tokyo 2009 Says:

    […] Here are the slides I used in that session. There are two “demo” sections in the slides… the first was a simple demo of Ubiquity 0.1.x showing off the translate, map, and edit-page commands. The second demo was of Ubiquity Parser 2 and showing off how little code it takes to add your language to Ubiquity with Parser 2. […]

  3. Changes to Ubiquity Parser 2 and the Playpen Says:

    […] Adding Your Language to Ubiquity Parser 2 […]


© 2006-2008 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress.
Entries (RSS) and Comments (RSS).
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.