mitcho Michael 芳貴 Erlewine

Linguist. Fifth year PhD student at MIT.

blog

Posts Tagged ‘linguistics’

Better Linguist List RSS Feeds

Monday, April 26th, 2010

Everyone I know in linguistics uses the LINGUIST List website to a greater or lesser degree. Linguist List began as a mailing list in the 90’s, with book, job, and dissertation announcements, call-for-papers, and general academic discussions.

Nowadays many people follow the various announcements on Linguist List using an RSS feed reader such as Google Reader or my personal favorite NetNewsWire.

Unfortunately, the Linguist List RSS feeds (at least recently) don’t include the full text of the articles and have a few other quirks as well. It’s often hard to judge based on the title whether it’s really something I’m interested or not, so I’ve spent a lot of time frustratedly opening any possibly interesting-looking entry in a separate NetNewsWire tab. Today I decided enough was enough: I just wrote a script which parses each of the Linguist List RSS feeds, finds the actual descriptions and interleaves them.1 It’s working remarkably well so far:

(more…)


  1. Veteran Linguist List RSS subscribers will also note that I’m adding the full title to the entry title for the Conferences and Calls lists as well. 

Spring is for Speaking: JSConf, WordCamp SF, IACL

Saturday, March 20th, 2010

I recently confirmed three different very exciting speaking gigs which I’ll be doing this spring:

(more…)

Living in the Stata Center

Monday, September 21st, 2009

We’re now three weeks into the semester at MIT where I just started a PhD program in linguistics. The Linguistics and Philosophy department is housed in The Ray and Maria [[Stata Center]], also known as building 32. It’s a [[Frank Gehry]] building and thus crazy looking.1

(more…)


  1. It also apparently has some structural problems; most notably leaks

Report from SIGIR Workshop on Information Access in a Multilingual World

Friday, July 24th, 2009

Yesterday I participated in and presented at a workshop on Information Access in a Multilingual World at ACM SIGIR in Boston. The focus of the workshop was on [[cross-language information retrieval]] (CLIR). Cross-language information retrieval systems enable users to retrieve relevant information across different languages for a certain task or query. Even if you have a budget to translate some documents from a foreign language to your language, how do you find the relevant documents to translate in the first place if you don’t speak (or read) that source language? This is the type of problem that CLIR aims to solve.

(more…)

Ubiquity Localization: What’s New, What’s Next

Thursday, July 9th, 2009

Yesterday we released Ubiquity 0.5, a major update to the already popular Ubiquity platform. Among numerous other features, Ubiquity 0.5 includes the first fruit of months of research on building a multilingual parser and natural language interface. In this blog post I’ll give a quick overview of new internationalization-related features in Ubiquity 0.5 as well as a quick roadmap of future considerations.

Of course, one of the best ways to learn about the new features is to experience them… try Ubiquity 0.5 now!

Install now!

(more…)

Ubiquity presentation at Tokyo 2.0

Wednesday, June 10th, 2009

T2P0.PNG

This past Monday I presented at Tokyo 2.0, Japan’s largest bilingual web/tech community. I presented as part of a session on The Web and Language, which I also helped organize. Other presenters included Junji Tomita from goo Labs, Shinjyou Sunao of Knowledge Creation, developers of the Voice Delivery System API, and Chris Salzberg of Global Voices Online on community translation.

I just put together a video of my Ubiquity presentation, mixing the audio recorded live at the presentation together with a screencast of my slides for better visibility. The presentation is 10 minutes long and is bilingual, English and Japanese.


Ubiquity: Command the Web with Language 言葉で操作する Web from mitcho on Vimeo.

(more…)

Solving a Romantic Problem: Portmanteau’ed Prepositions

Monday, May 11th, 2009

The problem:

In many [[romance languages]], prepositions and articles often form [[portmanteau|portmanteau morphs]], combining to form a single word.1 Some examples include (French) à + le > au, de + le > du, (Catalan) a + el > al, de + les > dels, per + el > pel. Italian has a particularly productive system of portmanteau’ed prepositions and articles… I refer you to the [[Contraction (grammar)#Italian|contraction]] article on Wikipedia.

As I noted a couple weeks ago, however, some combinations do not form portmanteaus.2

(more…)


  1. Thanks to Jeremy O’Brien for helping me figure out how to refer to this phenomenon. 

  2. This also relates to the issue of parsing multi-word delimiters, though the argument normalization strategy covered here should reduce the necessity of multi-word delimiters. 

In Case of Case…

Wednesday, May 6th, 2009

A recently hot topic of discussion in the Ubiquity i18n realm has been how to deal with strongly case-marking languages. As we continue to make steady progress, this is one of remaining open questions which we must decide as a community how to tackle in Parser 2.

Introduction

[[Grammatical case]] is a marking on nouns that express grammatical function. Not all languages exhibit case. In many of the Indo-European languages we hope to bring Ubiquity to, case is realized as a suffix.1

Here’s a classic example of case from [[Latin]]. (Line 2 is the gloss of 1, line 4 of 3.)

  1. canis      virum      momordit
    
  2. dog=sg.NOM man=sg.ACC bite=3sg.perfect
    
  3. vir        canem      momordit
    
  4. man=sg.NOM dog=sg.ACC bite=3sg.perfect

Example (1) is “the man bit the dog,” while example (3) is “the dog bit the man.” The only difference, as you see in the gloss, is that the nouns canis and vir are marked with different case endings in the two sentences. By marking the nouns with different cases (here, [[nominative]] and [[accusative]]), their semantic roles in the sentence—which is the the biter and which is the bitee—can be identified unambiguously. (Their positions are also switched in these examples but in reality Latin has a very free word order—the same sentences with other word orders including OSV or VSO are also common.)

At first glance, strongly case-marked languages may look like a godsend for identifying the semantic roles of arguments.2 If we can easily and unambiguously recognize arguments’ cases to put them in their appropriate semantic roles, this could simplify processing as well as make Ubiquity input follow a natural syntax for such languages. Unfortunately, there are some significant challenges which must be overcome in order to make the processing of case-markers worthwhile.

(more…)


  1. Note that when linguists talk about “case,” they could be referring to two different (though related) concepts: case (lowercase) is the observed pattern of affixes on nouns which indicate grammatical function, while Case (uppercase) refers to a theoretical (formal) feature of syntactic objects—certain lexical items “assign Case” or “receive Case” and its mismatches were ruled out in [[Government and binding theory|GB]] syntax by the Case Filter. You’ll find GB linguistics papers referring to “case” when discussing Mandarin Chinese, for example, a language that doesn’t have any overt case (lowercase) and you’ll know immediately that this usage is an uppercase Case case. In this blog post I’ll be dealing primarily with the former descriptive notion. 

  2. When I refer to “strongly case-marking languages,” I am referring to languages with a non-trivial inventory of cases (not just nominative, accusative, and genitive) and where a noun phrase’s case is not reflected on [[determiner (class)|determiners]]. For example, [[German language|German]] is excluded by this definition as case is realized exclusively on articles and there is no need to find and parse the noun head itself to identify its case—more information on German is in the section “finding the edges.” 

Dates in the Month of May that Are of Interest to Linguists

Friday, May 1st, 2009

Happy May! May, as you surely know, is an important season of celebration for linguists. Some of my favorite items are below.

From Dates in the Month of May that Are of Interest to Linguists by the late [[James D. McCawley]]:

May 6, 1939. The University of Chicago trades Leonard Bloomfield to Yale University for two janitors and an undisclosed number of concrete gargoyles.

May 23, 38,471 B.C. God creates language.

May 29, 1962. Angular brackets are discovered. Classes at M.I.T. are dismissed and much Latvian plum brandy is consumed.

May 31, 1951. Chomsky discovers Affix-hopping and is reprimanded by his father for discovering rules on shabas.

Unfortunately May 31, 1951 was a Thursday…

Adding Your Language to Ubiquity Parser 2

Wednesday, April 29th, 2009

NOTE: This blog post has now been added to the Ubiquity wiki and is updated there. Please disregard this article and instead follow these instructions.

You’ve seen the video. You speak another language. And you’re wondering, “how hard is it to add my language to Ubiquity with Parser 2?” The answer: not that hard. With a little bit of JavaScript and knowledge of and interest in your own language, you’ll be able to get at least rudimentary Ubiquity functionality in your language. Follow along in this step by step guide and please submit your (even incomplete) language files!

As Ubiquity Parser 2 evolves, there is a chance that this specification will change in the future. Keep abreast of such changes on the Ubiquity Planet and/or this blog (RSS).

(more…)

Attachment Ambiguity—or—when is the gyudon cheap?

Wednesday, April 15th, 2009

yoshinoya.jpg

Every day on the way to work I walk by a fine establishment known as [[Yoshinoya]] (吉野家), Japan’s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by [[gyudon]], it’s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.

Part of this sign, though, made me think about our new Ubiquity parser. In particular, it was the attachment ambiguity in the end date of the promotion. The text in the photo above literally is “April 15th (Wed.) 8PM until”. (Note that Japanese is a strongly head-final language, and that the “until” is a postposition.) There are two possible readings for this expression, as illustrated by the two [[principle of compositionality|composition]] trees below.

(more…)

Scoring and Ranking Suggestions

Tuesday, April 7th, 2009

I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I’d put some of these thoughts down in writing.

The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four “scores” for each suggestion:

(more…)

Where’s The Verb?

Wednesday, March 25th, 2009

Ubiquity’s proposed new parser design is based on a [[principles and parameters]] philosophy: we can build an underlying universal parser and, for each individual language, we simply set some “parameters” to tell the parser how to act. As we consider the design’s pros and cons, it’s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.

Today I’ll examine highlight some disparate typological data to help us understand these questions: where’s the verb? and what does the verb look like? (more…)

Automating the Linguist’s Job

Tuesday, March 24th, 2009

At the end of my blog post yesterday I hinted at an exciting possible approach to Ubiquity’s localization:

In the future we ideally could build a web-based system to collect these “utterances.” We could … generate parser parameters based on those sentences. That would essentially reduce the parser-construction process to a more run-of-the-mill string translation process.

If we build this type of “command-bank” of common Ubiquity input translated into various languages, we could build a tool to learn various features of each language and generate each parser, essentially learning the language based on data. Today I’ll elaborate on how I believe this could be possible, by analogy to another language learning device: the human.

(more…)

Ubiquity i18n: questions to ask

Monday, March 23rd, 2009

I recently have traveled a fair deal and have met many people excited about the Ubiquity project and its localization efforts. “I want to help,” say the people, but many are unsure where to start.

As a linguist, studying a language involves looking at instances of that language as data. To this end, we as Ubiquity internationalizers need to get at some examples of target utterances. Here’s an example survey which could be a good starting point for native speakers who want to contribute information on their language, based on Blair’s list of common Ubiquity verbs.

(more…)