Adding Your Language to Ubiquity Parser 2
NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow [these instructions][1].
You’ve [seen the video][2]. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please [submit your (even incomplete) language files][3]!
As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the [Ubiquity Planet][4] and/or [this blog][5] ([RSS][6]).
Set up your environment
If you’re new to Ubiquity core development, you’ll want to first read the [Ubiquity 0.1 Development Tutorial][7] to learn how to get a live copy of the Ubiquity repository using [[Mercurial]]. Once you’ve set up your Firefox profile to use this development version, make sure to try changing the extensions.ubiquity.parserVersion
value to 2 in about:config
(as seen in [this demo video][8]) to verify that Parser 2 is working for you.
As you read along, you may find it beneficial to follow along in the languages currently included in Parser 2: [English][9], [Japanese][10], [Portuguese][11], and [Swedish][12] (and the incomplete [Chinese][13] and [French][14]).
The structure of the language file
Each language in Parser 2 gets its own file which acts as a [JavaScript module][15]. You’ll need to look up the [[List of ISO 639-1 codes | ISO 639-1 code for your language]]… Here we’ll use English (code en ) as an example here and the JavaScript language file would then be called en.js and go in the /ubiquity/modules/parser/new/ directory of the repository. |
Here is the basic template for a Ubiquity Parser 2 language file:
var EXPORTED_SYMBOLS = ["makeEnParser"];</p>
if ((typeof window) == 'undefined') // kick it chrome style
Components.utils.import("resource://ubiquity/modules/parser/new/parser.js");
function makeEnParser() {
var en = new Parser('en');
...
return en;
};
</pre>
After lines 1-4 which set up the JavaScript module, everything else is wrapped in a factory function called makeLaParser
(for Latin) or makeEnParser
(for English, en
) or makeFrParser
(for French, fr
), etc. This function initializes the new Parser
object (line 7) with the appropriate language code, sets a bunch of parameters (elided above) and returns it. That’s it!
Now let’s walk through some of the parameters you must set to get your language working. For reference, the properties the language parser object is required to have are: branching
, anaphora
, and roles
.
Identifying your branching parameter
en.branching = 'right'; // or 'left'
One of the first things you’ll have to set for your parser is the branching
parameter. Ubiquity Parser 2 uses the branching parameter to decide which direction to look for an argument after finding a delimiter or “role marker” (most often, these are [[adposition|prepositions or postpositions]]. For example, in English “from” is a delimiter for the goal
role and its argument is on its right.
to | Mary | from | John |
So “John” is a possible argument for the source
role, but “Mary” should not be. Ubiquity can figure this out because English has the property en.branching = 'right'
.
In Japanese, on the other hand, the argument of a delimiter like から (“from”) is found on the left of that delimiter, so en.branching = 'left'
.
メアリー | -から | ジョン | -に |
Mary | from | John | to |
In general, if your language has prepositions, you should use .branching = 'right'
and if your language has postpositions, you can use .branching = 'left'
.
For more info:
- see [[Branching (linguistics)|branching]] on Wikipedia.
Defining your roles
en.roles = [
{role: 'goal', delimiter: 'to'},
{role: 'source', delimiter: 'from'},
{role: 'position', delimiter: 'at'},
{role: 'position', delimiter: 'on'},
{role: 'alias', delimiter: 'as'},
{role: 'instrument', delimiter: 'using'},
{role: 'instrument', delimiter: 'with'}
];
The second required property is the inventory of semantic roles and their corresponding delimiters. Each entry has a role
from the inventory of semantic roles and a corresponding delimiter. Note that this mapping can be [[many-to-many (data model)|many-to-many]], i.e., each role can have multiple possible delimiters and different roles can have shared delimiters. Try to make sure to cover all of the roles in the inventory of semantic roles.
For more info:
- Writing commands with semantic roles
- the proposed inventory of semantic roles
- Wikipedia entry on [[thematic relations]]
Entering your anaphora (“magic words”)
en.anaphora = ["this", "that", "it", "selection", "him", "her", "them"];
The final required property is the anaphora
property which takes a list of “magic words”. Currently there is no distinction between all the different [[deixis|deictic]] [[anaphora (linguistics)|anaphora]] which might refer to different things.
Special cases
Some special language features can be handled by overriding the default behavior from Parser
. Many of these features are still in the works, however, so we’d love to get your comments!
Languages with no spaces
If your language does not delimit arguments (or words, more generally) with spaces, there will be a need to write a custom wordBreaker()
function and set usespaces = false
and joindelimiter = ''
. For an example, please take a look at the Japanese or Chinese.
Case marking languages
If you have a strongly [[grammatical case|case-marked]] language, you’ll have to write some rules to identify those different cases in wordBreaker()
and then add some extra roles
for these case markers, but for a number of languages the current design does not allow an elegant solution for parsing such arguments. Updates to this issue will be posted to this trac ticket.
</p>
In the mean time, however, if you could write a parser even with only the prepositions/postpositions in your language, that would be a great benefit in getting started in your language.</strike> UPDATE: a proposal on how to deal with strongly case-marked languages has been written here: In Case of Case….
Stripping articles
Some languages have some delimiters which combine with articles. For example, in French, the preposition “à” combines with the masculine definite article “le” but not “la”:
- à + la = à la
- à + le = au
You can add both “à” and “au” as delimiters of the goal
role, but then you will get feminine arguments back with the determiner (e.g. “la table”) while masculine arguments would be parsed without a determiner (e.g. “chat”).
- “à la table” = “to the table”
- “au chat” = “to the cat”
One possible solution to this is to write a custom UPDATE: For more up-to-date information on how to deal with these types of articles, please see Solving a Romance Problem.
cleanArgument()
method. After arguments have been parsed and placed in their appropriate roles, each argument text (say, “la table” or “chat”) are passed to cleanArgument()
. You can simply write a cleanArgument()
to strip off any “la ” at the beginning of the input and return it and both example inputs will get normalized arguments: “table” and “chat”, respectively.
Test your parser
Now you can go into about:config
and change extensions.ubiquity.language
to be your language code and restart. All the verbs and nountypes at this point will remain the same as in the English version, but it should obey the argument structure (the word order and delimiters) of your language.
More to come…
At this point, you’ve only localized the [[argument structure]] of your language… additional work will be required to localize the nountypes and verb names, which is the subject of ongoing discussion… join the Google Group to get in on the discussion!
At this point in time it’s also possible to test your parser at chrome://parser-demo/content/index.html
if you make a couple other changes to your code… for more information, watch the Foxkeh demos Ubiquity Parser TNG video. This option gives you more debug info as well.