Adding Your Language to Ubiquity Parser 2

NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow [these instructions][1].

You’ve [seen the video][2]. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please [submit your (even incomplete) language files][3]!

As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the [Ubiquity Planet][4] and/or [this blog][5] ([RSS][6]).

Set up your environment

If you’re new to Ubiquity core development, you’ll want to first read the [Ubiquity 0.1 Development Tutorial][7] to learn how to get a live copy of the Ubiquity repository using [[Mercurial]]. Once you’ve set up your Firefox profile to use this development version, make sure to try changing the extensions.ubiquity.parserVersion value to 2 in about:config (as seen in [this demo video][8]) to verify that Parser 2 is working for you.

As you read along, you may find it beneficial to follow along in the languages currently included in Parser 2: [English][9], [Japanese][10], [Portuguese][11], and [Swedish][12] (and the incomplete [Chinese][13] and [French][14]).

The structure of the language file

Each language in Parser 2 gets its own file which acts as a [JavaScript module][15]. You’ll need to look up the [[List of ISO 639-1 codes ISO 639-1 code for your language]]… Here we’ll use English (code en) as an example here and the JavaScript language file would then be called en.js and go in the /ubiquity/modules/parser/new/ directory of the repository.

Here is the basic template for a Ubiquity Parser 2 language file:

var EXPORTED_SYMBOLS = ["makeEnParser"];</p>

if ((typeof window) == 'undefined') // kick it chrome style Components.utils.import("resource://ubiquity/modules/parser/new/parser.js");

function makeEnParser() { var en = new Parser('en');


return en; }; </pre>

After lines 1-4 which set up the JavaScript module, everything else is wrapped in a factory function called makeLaParser (for Latin) or makeEnParser (for English, en) or makeFrParser (for French, fr), etc. This function initializes the new Parser object (line 7) with the appropriate language code, sets a bunch of parameters (elided above) and returns it. That’s it!

Now let’s walk through some of the parameters you must set to get your language working. For reference, the properties the language parser object is required to have are: branching, anaphora, and roles.

Identifying your branching parameter

  en.branching = 'right'; // or 'left'

One of the first things you’ll have to set for your parser is the branching parameter. Ubiquity Parser 2 uses the branching parameter to decide which direction to look for an argument after finding a delimiter or “role marker” (most often, these are [[adposition|prepositions or postpositions]]. For example, in English “from” is a delimiter for the goal role and its argument is on its right.

to Mary from John

So “John” is a possible argument for the source role, but “Mary” should not be. Ubiquity can figure this out because English has the property en.branching = 'right'.

In Japanese, on the other hand, the argument of a delimiter like から (“from”) is found on the left of that delimiter, so en.branching = 'left'.

メアリー -から ジョン -に
Mary from John to

In general, if your language has prepositions, you should use .branching = 'right' and if your language has postpositions, you can use .branching = 'left'.

For more info:

  • see [[Branching (linguistics)|branching]] on Wikipedia.

Defining your roles

  en.roles = [
    {role: 'goal', delimiter: 'to'},
    {role: 'source', delimiter: 'from'},
    {role: 'position', delimiter: 'at'},
    {role: 'position', delimiter: 'on'},
    {role: 'alias', delimiter: 'as'},
    {role: 'instrument', delimiter: 'using'},
    {role: 'instrument', delimiter: 'with'}

The second required property is the inventory of semantic roles and their corresponding delimiters. Each entry has a role from the inventory of semantic roles and a corresponding delimiter. Note that this mapping can be [[many-to-many (data model)|many-to-many]], i.e., each role can have multiple possible delimiters and different roles can have shared delimiters. Try to make sure to cover all of the roles in the inventory of semantic roles.

For more info:

Entering your anaphora (“magic words”)

  en.anaphora = ["this", "that", "it", "selection", "him", "her", "them"];

The final required property is the anaphora property which takes a list of “magic words”. Currently there is no distinction between all the different [[deixis|deictic]] [[anaphora (linguistics)|anaphora]] which might refer to different things.

Special cases

Some special language features can be handled by overriding the default behavior from Parser. Many of these features are still in the works, however, so we’d love to get your comments!

Languages with no spaces

If your language does not delimit arguments (or words, more generally) with spaces, there will be a need to write a custom wordBreaker() function and set usespaces = false and joindelimiter = ''. For an example, please take a look at the Japanese or Chinese.

Case marking languages

If you have a strongly [[grammatical case|case-marked]] language, you’ll have to write some rules to identify those different cases in wordBreaker() and then add some extra roles for these case markers, but for a number of languages the current design does not allow an elegant solution for parsing such arguments. Updates to this issue will be posted to this trac ticket. </p>

In the mean time, however, if you could write a parser even with only the prepositions/postpositions in your language, that would be a great benefit in getting started in your language.</strike> UPDATE: a proposal on how to deal with strongly case-marked languages has been written here: In Case of Case….

Stripping articles

Some languages have some delimiters which combine with articles. For example, in French, the preposition “à” combines with the masculine definite article “le” but not “la”:

  1. à + la = à la
  2. à + le = au

You can add both “à” and “au” as delimiters of the goal role, but then you will get feminine arguments back with the determiner (e.g. “la table”) while masculine arguments would be parsed without a determiner (e.g. “chat”).

  1. à la table” = “to the table”
  2. au chat” = “to the cat”

One possible solution to this is to write a custom cleanArgument() method. After arguments have been parsed and placed in their appropriate roles, each argument text (say, “la table” or “chat”) are passed to cleanArgument(). You can simply write a cleanArgument() to strip off any “la ” at the beginning of the input and return it and both example inputs will get normalized arguments: “table” and “chat”, respectively. UPDATE: For more up-to-date information on how to deal with these types of articles, please see Solving a Romance Problem.

Test your parser

Now you can go into about:config and change extensions.ubiquity.language to be your language code and restart. All the verbs and nountypes at this point will remain the same as in the English version, but it should obey the argument structure (the word order and delimiters) of your language. If you run into any trouble, feel free to ask for help on the Ubiquity i18n listhost or find me on the Ubiquity IRC channel (mitcho @ Of course, once you’re at a good stopping point, please contribute your language file to Ubiquity!

More to come…

At this point, you’ve only localized the [[argument structure]] of your language… additional work will be required to localize the nountypes and verb names, which is the subject of ongoing discussionjoin the Google Group to get in on the discussion!

At this point in time it’s also possible to test your parser at chrome://parser-demo/content/index.html if you make a couple other changes to your code… for more information, watch the Foxkeh demos Ubiquity Parser TNG video. This option gives you more debug info as well.

[1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: ( [9]: [10]: [11]: [12]: [13]: [14]: [15]: