Everyone I know in linguistics uses the LINGUIST List website to a greater or lesser degree. Linguist List began as a mailing list in the 90’s, with book, job, and dissertation announcements, call-for-papers, and general academic discussions.
Nowadays many people follow the various announcements on Linguist List using an RSS feed reader such as Google Reader or my personal favorite NetNewsWire.
Unfortunately, the Linguist List RSS feeds (at least recently) don’t include the full text of the articles and have a few other quirks as well. It’s often hard to judge based on the title whether it’s really something I’m interested or not, so I’ve spent a lot of time frustratedly opening any possibly interesting-looking entry in a separate NetNewsWire tab. Today I decided enough was enough: I just wrote a script which parses each of the Linguist List RSS feeds, finds the actual descriptions and interleaves them.1 It’s working remarkably well so far:
We’re now three weeks into the semester at MIT where I just started a PhD program in linguistics. The Linguistics and Philosophy department is housed in The Ray and Maria Stata Center, also known as building 32. It’s a Frank Gehry building and thus crazy looking.1
Yesterday I participated in and presented at a workshop on Information Access in a Multilingual World at ACM SIGIR in Boston. The focus of the workshop was on cross-language information retrieval (CLIR). Cross-language information retrieval systems enable users to retrieve relevant information across different languages for a certain task or query. Even if you have a budget to translate some documents from a foreign language to your language, how do you find the relevant documents to translate in the first place if you don’t speak (or read) that source language? This is the type of problem that CLIR aims to solve.
This past Monday I presented at Tokyo 2.0, Japan’s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global Voices Online on community translation.
I just put together a video of my Ubiquity presentation, mixing the audio recorded live at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.
In many romance languages, prepositions and articles often form portmanteau morphs, combining to form a single word.1 Some examples include (French) à + le > au, de + le > du, (Catalan) a + el > al, de + les > dels, per + el > pel. Italian has a particularly productive system of portmanteau’ed prepositions and articles… I refer you to the contraction article on Wikipedia.
Thanks to Jeremy O’Brien for helping me figure out how to refer to this phenomenon. ↩
This also relates to the issue of parsing multi-word delimiters, though the argument normalization strategy covered here should reduce the necessity of multi-word delimiters. ↩
Grammatical case is a marking on nouns that express grammatical function. Not all languages exhibit case. In many of the Indo-European languages we hope to bring Ubiquity to, case is realized as a suffix.1
Here’s a classic example of case from Latin. (Line 2 is the gloss of 1, line 4 of 3.)
Example (1) is “the man bit the dog,” while example (3) is “the dog bit the man.” The only difference, as you see in the gloss, is that the nouns canis and vir are marked with different case endings in the two sentences. By marking the nouns with different cases (here, nominative and accusative), their semantic roles in the sentence—which is the the biter and which is the bitee—can be identified unambiguously. (Their positions are also switched in these examples but in reality Latin has a very free word order—the same sentences with other word orders including OSV or VSO are also common.)
At first glance, strongly case-marked languages may look like a godsend for identifying the semantic roles of arguments.2 If we can easily and unambiguously recognize arguments’ cases to put them in their appropriate semantic roles, this could simplify processing as well as make Ubiquity input follow a natural syntax for such languages. Unfortunately, there are some significant challenges which must be overcome in order to make the processing of case-markers worthwhile.
Note that when linguists talk about “case,” they could be referring to two different (though related) concepts: case (lowercase) is the observed pattern of affixes on nouns which indicate grammatical function, while Case (uppercase) refers to a theoretical (formal) feature of syntactic objects—certain lexical items “assign Case” or “receive Case” and its mismatches were ruled out in GB syntax by the Case Filter. You’ll find GB linguistics papers referring to “case” when discussing Mandarin Chinese, for example, a language that doesn’t have any overt case (lowercase) and you’ll know immediately that this usage is an uppercase Case case. In this blog post I’ll be dealing primarily with the former descriptive notion. ↩
When I refer to “strongly case-marking languages,” I am referring to languages with a non-trivial inventory of cases (not just nominative, accusative, and genitive) and where a noun phrase’s case is not reflected on determiners. For example, German is excluded by this definition as case is realized exclusively on articles and there is no need to find and parse the noun head itself to identify its case—more information on German is in the section “finding the edges.” ↩
NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow these instructions.
You’ve seen the video. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please submit your (even incomplete) language files!
As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the Ubiquity Planet and/or this blog (RSS).
Every day on the way to work I walk by a fine establishment known as Yoshinoya (吉野家), Japan’s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by gyudon, it’s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.
Part of this sign, though, made me think about our new Ubiquity parser. In particular, it was the attachment ambiguity in the end date of the promotion. The text in the photo above literally is “April 15th (Wed.) 8PM until”. (Note that Japanese is a strongly head-final language, and that the “until” is a postposition.) There are two possible readings for this expression, as illustrated by the two composition trees below.
I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I’d put some of these thoughts down in writing.
The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four “scores” for each suggestion:
Ubiquity’s proposed new parser design is based on a principles and parameters philosophy: we can build an underlying universal parser and, for each individual language, we simply set some “parameters” to tell the parser how to act. As we consider the design’s pros and cons, it’s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.
Today I’ll examine highlight some disparate typological data to help us understand these questions: where’s the verb? and what does the verb look like?(more…)
At the end of my blog post yesterday I hinted at an exciting possible approach to Ubiquity’s localization:
In the future we ideally could build a web-based system to collect these “utterances.” We could … generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.
If we build this type of “command-bank” of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially learning the language based on data. Today I’ll elaborate on how I believe this could be possible, by analogy to another language learning device: the human.
I recently have traveled a fair deal and have met many people excited about the Ubiquity project and its localization efforts. “I want to help,” say the people, but many are unsure where to start.
As a linguist, studying a language involves looking at instances of that language as data. To this end, we as Ubiquity internationalizers need to get at some examples of target utterances. Here’s an example survey which could be a good starting point for native speakers who want to contribute information on their language, based on Blair’s list of common Ubiquity verbs.