localization of standard feed commands for a few languages
Parser 2 language files for those same languages
Nongoals for 0.5
distribution/sharing of localizations
localization of nountypes
The overall goal for this release of Ubiquity is to come up with a format and standard for localization. Localizations in Ubiquity 0.5 will only apply to commands bundled with Ubiquity, and the localization files themselves will be distributed with Ubiquity. In a future release we will tackle the problem of localizations for commands in the wild and truly croud-source1 this process.
This past Monday I presented at Tokyo 2.0, Japan’s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global Voices Online on community translation.
I just put together a video of my Ubiquity presentation, mixing the audio recorded live at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.
Here’s a quick screencast highlighting some of the changes to Parser 2 and the updated Parser 2 Playpen. This video should be particularly useful to people hoping to add their language to Parser 2. It’s also a good reference for Ubiquity core developers.
The Hit List is a to-do list app for Mac OS X with a beautiful interface and some nice features. Creator Andy Kim’s latest blog post (Better Software Through Less UI) is excellent reading for the Ubiquity community. He describes the thought process behind the design of a new clean and “frictionless” interface for specifying how tasks are repeated. After throwing out the regular combinations and templates of different input widgets, his solution was to implement a partial natural language input interface:
There is no myriad of buttons and fields to choose from. All the user has to do is directly type in what he wants.
Here are a couple other choice quotes which will ring true for the Ubiquity users and internationalization folks in the audience:
For this to work without driving the user mad, the natural language parser has to be near perfect. The last thing I want is for this to come out smelling like AppleScript.
Problems This design isn’t perfect as it has two glaring problems. One is that the user has no easy way of discovering how complex the recurrence rules can be. This isn’t such a huge problem, but a way to solve this is to include a help button to show example rules or to include an accompanying iCal style UI to let the user setup the recurrence rule in a more typical fashion. I didn’t include these in the initial implementation though because I wanted to see how users would react to this kind of UI. Another problem is localization. Even if I write parsers for a few more popular languages, it won’t accommodate the rest of the users in the world. Again, the solution is an accompanying traditional UI, but for now, I’m leaving it the way it is until I get some feedback.
There’s a trend in the wind, my friends: the incorporation of near-natural language for more humane interfaces.
A few weeks ago I made some visual mockups of how Ubiquity could look and act in Japanese. Part of this proposal was what I called “particle identification”: that is, immediate in-line identification of delimiters of arguments, which can be overridden by the user:
The inspiration for this idea came from Aza’s blog post “Solving the ‘it’ problem” which advocates for this type of quick feedback to the user in cases of ambiguity. Such a method would help both the user better understand what is being interpreted by the system, as well as offer an opportunity for the user to correct improper parses. I just tried mocking up such an input box using jQuery.
Earlier today I blogged on three different strategies languages use to mark the roles of different arguments: word order, marking on the arguments, and marking on the verbs.
I gathered some data from the fantastic World Atlas of Language Structures to put together a survey of many of the languages on the Internet. For each of the languages, I got the canonical word order and whether the language marks the role of its argument on the verb and/or the arguments themselves.
As you can see, there are a number of data points that are still missing. Please contribute information on the languages you speak! You can edit the spreadsheet on Google Docs. Thanks!
I’m very happy to announce that, starting today, I will be working full-time on Ubiquity, a Mozilla Labs experiment to connect the web with language. I’ll be heading up research on different linguistic issues of import to a linguistic user interface and blogging about these topics here. If you’re interested, please subscribe to my blog’s RSS feed or the RSS feed for only Ubiquity-related items. Commenting is encouraged! ^^
Every day, more users are trying out Ubiquity, the Mozilla Labs experiment that lets users accomplish common Internet tasks faster through a natural language interface. As we live more and more of our lives on the web, there is a huge appeal to—and need for—a faster way to access and mashup our information.
But what exactly do we mean by a “natural language interface”? Is it just another programming language with lots of English keywords? Should the final goal be a computer that understands everything we tell it?
As we think about the future directions and possibilities of Ubiquity, we need to go back to our roots and understand the project’s motivations. With that in mind, here are some initial thoughts on the advantages of a natural language interface. The ultimate goal here is to refine the notion of natural language interface and to come up with a set of principles that we can follow in pushing Ubiquity further, into other languages and beyond.
Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:
回収 may take things away from others when collecting while 収集 does not have that implication.
Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1
Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: Google.2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.
This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.” ↩
Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the n-gramdata they offer for research. ↩
”Collocation” on Wikipedia says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.” ↩
As Google adds ten more languages to its machine translation service, it seems to be on its way to becoming the most convenient universal translator of the world’s popular languages. Google’s handling of languages of course isn’t perfect, however—in particular, I’ve been complaining to friends for a while about the weaknesses of Google’s handling of queries in Chinese character (漢字/汉字) scripts. In this post, I run some tests using Google’s Language Detection service to try to better understand its handling of Chinese character queries.
Background
Chinese characters have been used all across East Asia, most notably in Chinese, Japanese, Korean, and Vietnamese (the “CJKV”). Prescriptivist writing reforms in Communist China and Japan have simplified many characters, though. Some characters were simplified in the same way, some in different ways, and some in only one country but not the other. For more information, there’s Wikipedia or Ken Lunde’s CJKV Information Processing.
The problem
The issue comes up when you try to search for a word in Chinese characters which clearly came from one Chinese character-using language. From my experience, Google doesn’t consider which language you are a user of, based on the query, and returns many results in other Chinese character-using languages as well.[^1]
I don’t care much for the game, but always love checking out the SuperBowl ads every year… this year there was something really cool… a sign language ad by a deaf group at PepsiCo.1 Very cool.
The crew has their own website at Pepsi too: Bob’s House.
I just noticed something on the latest Tekzilla Daily: Patrick Norton, host of Tekzilla and former host of the Screen Savers says “there’s a lots to learn here” (1:28) and then later “the site you’re having troubles with” (1:39). While “having troubles with…” is fine, I believe “having trouble with…” is much more common. As for “a lots to learn,” however, that’s definitely out. Is it hyperarticulation? I don’t know.
Wikipedia notes: “Norton grew up in the Midwest, but considers the Jersey Shore his home… He currently lives in San Francisco, California.” So, is this a Jersey Shore or California thing? I have no idea.
‘Setting Language Research to Music’ is a Newcastle University project whose aim
is to compose orchestra and choral music to demonstrate infant perception and
production. The first piece of music to emerge from the project, ‘Swing Cycle’,
mimics babies’ experience of discovering word boundaries, taking work by Peter
Jusczyk and colleagues as a starting point.
It’s the craziest thing I’ve seen in a long while… it reminds me of the Music: Materials and Design course I took a couple years ago. My final project was an electronic composition building a rhythm with political speech samples and echos and cracking noises, representing the hollowness of political rhetoric. It was one of my academic low points at Chicago, for sure.
Maybe it’s because I’m an artist, but I’ve never understood the drive for modern art, including compositions like these. I would much rather listen to some music and read about language acquisition separately… the motivation to combine the two eludes me.
The funny thing about all of these is that we don’t speak commas. It’s used to graphically represent pauses in speech, but are often used according to certain artificial rules which, when used systematically, aim to help the reader parse the sentence or help disambiguate between different readings.1
I’m surprised Language Log hasn’t picked up this new piece yet. UPDATE: Yup, they got to it. Great coverage, as always.
We use pauses in spoken language to do this too, but not necessarily in the same places that we place commas in “good” written language. ↩