Yesterday we [released Ubiquity 0.5], a major update to the already popular Ubiquity platform. Among [numerous other features], Ubiquity 0.5 includes the first fruit of [months of research on building a multilingual parser and natural language interface]. In this blog post I’ll give a quick overview of new internationalization-related features in Ubiquity 0.5 as well as a quick roadmap of future considerations.
Of course, one of the best ways to learn about the new features is to experience them… try Ubiquity 0.5 now!
Preface: What’s What
To give users a completely localized experience, there are many different components that need to be made to work with different languages. In a single Ubiquity input, like
translate hello from English to Spanish
there are actually many different components that need to all be localized in order to comprehend the equivalent sentence in a different language. The diagram below will give you a sense for the different components that need to be localized: the parser, verbs, and nountypes.
|element type:||verb||free argument||delimiter||structured argument||delimiter||structured argument|
|component to localize:||verb name||parser||nountype||parser||nountype|
Ubiquity 0.5’s improved language support can be thought of as the product of two more or less orthogonal developments: the brand-new parser, Parser 2, as well as local command localization support.
Parser 2 (née [Parser: The Next Generation]) is a completely new parser designed to support different languages easily. Taking a serious look at the similarities and differences between different languages, we created a universal [parser design] which takes a minimal set of settings for particular languages to “learn” that language’s grammar.
The key insight to Parser 2’s design is that, for the limited range of inputs Ubiquity should understand, languages deal with them in remarkably similar ways. The input we’re dealing with here are all commands or actions without quantification or negation. These are all comprised of a single verb and a series of arguments with certain markings to designate their roles in the sentence. For example, here’s our example Ubiquity input:
translate hello from English to Spanish
In this example, “translate” is the verb, which we recognize by looking at our bank of known verbs, and the rest of the input can be split up into three different arguments: “hello,” “from English,” and “to Spanish.” Of these, the markers “from” and “to” tell us that “English” is a source of some sort and “Spanish” is a goal, while the unmarked “hello” is simply an object—the target of the action. By identifying arguments by these abstract [semantic roles], we’re able to quickly identify different kinds of arguments in different languages. For example, the following is the exact same example but using the Japanese syntax and markers:
Ubiquity knows what the different markers mean in Japanese, like “を” >
object, “から” >
source, “に” >
goal, and can easily interpret this to mean the exact same command as (1). With just a few lines of code, [you can teach] Ubiquity how to recognize these different semantic roles in your language. This innovation also means that Ubiquity commands can be [written once for one language and automatically used with another language’s parser], bringing us half-way to the goal of command localization.
Note also that Japanese (as in example (2)) is verb-final and uses no spaces between words. We’ve tried to make Parser 2 itself agnostic towards these types of different ways in which languages vary.
Parser 2 also adds [better argument-first suggestions], inspired by some [earlier thoughts on Ubiquity in Japanese]. Ubiquity will now start to parse arguments in the input even if a verb isn’t found, and suggest verbs based on that input. For example, if you enter “hello to Spanish,” it’ll recognize that you have an object of “hello” and a goal of “Spanish” which can be understood as a language name, so it’ll suggest the verb “translate.” This is the way it should be.
For more information and background, feel free to check out some of my previous blog posts on the new parser and on the different linguistic considerations. I also have a four-page academic paper giving an overview of some innovations in the parser—email me at
x=mitcho if you’d like to get a copy.
Internationalization of bundled commands
The move to use [semantic roles] in the [new command API], described above, means that the same Ubiquity command code can be used with inputs in different languages. Two things are left, then, to make a completely localized input work: (1) translation (localization) of different strings in the commands and (2) localization of the nountypes.
In Ubiquity 0.5, we built a localization infrastructure for commands (1, above) but have not yet tackled the nountypes (2). Ubiquity 0.5 uses the [[gettext]]
po (portable object) file format for localizations, which many localizers in the UNIX world are very familiar with. This [choice of file format] potentially opens Ubiquity localization up to many who are new to localization or are unfamiliar with other Mozilla localization. Ubiquity is able to produce localization templates by itself and we also have [a great tool] to check the completeness of different localizations.
A huge caveat, however, is that this localization support currently only works with the commands bundled with Ubiquity itself.
We’re going to continue working to make Ubiquity [more natural] for more users. The tasks we have ahead of us are the localization of nountypes and community commands.
With the new semantic role argument specifications, command localization simply became a question of translating some strings, which many localizers are used to. After all, we want localizations to affect the presentation of commands, not the logic of the commands. When it comes to nountypes, however, it is quite possible that we would actually want the nountype localization to affect its behavior.
Consider, for example, an imaginary
day_of_the_week nountype. In English, this nountype might accept or suggest strings like “Monday” or “Tuesday,” while a French localization would accept “lundi” or “mardi.” More complicated still, consider a
date nountype. In English this nountype may have custom logic to parse strings like “June 1st” while another language may have to parse very different kinds of strings. These nountype localizations thus involve not just string translations, but actual changes in their logic, making the
po format approach we took to command localization a poor fit.
Making nountypes localizable, however, will make Ubiquity significantly more “natural” for many users. In the coming weeks and months we’ll be discussing and debating different options to accomplish this.
Community command localization
Even though the file format and infrastructure for command localization itself has been fleshed out with Ubiquity 0.5, the distributed nature of all these community commands adds an additional complication. Do we want community command localizations to be completely distributed, or should they be centralized? If they’re distributed, how do you find them? These are the types of questions we’ll need to ask and answer. The ease of creating a new Ubiquity command and sharing it with the world is a huge asset of the platform, so we’ll definitely be thinking about how best to localize these community commands as well. In the next day or two I’ll be writing up a more detailed blog post on what we need from a good community command localization solution.
For the more visually inclined (including myself), here’s a handy diagram to summarize what components are localizable now, what will be in the future, and what this means for Ubiquity users of different languages.
|localized components||Japanese input that Ubiquity will understand||support coverage|
|for bundled commands||for community commands|
|no localization||translate hello from English to Spanish||Ubiquity 0.5!||Ubiquity 0.5!</td></tr>|
|parser + verbs||helloをEnglishからSpanishに訳す||the future|
|parser + verbs + nountypes||helloを英語からスペイン語に訳す||the future|