mitcho Michael 芳貴 Erlewine

Linguist. Fifth year PhD student at MIT.

blog

Archive for the ‘observation’ Category

Stanley Kubrick on linguistic fieldwork

Monday, July 29th, 2013

I have found that when you finally come down to the day the scene is going to be shot of the elicitation and you arrive on the location with the actors speakers, having had the experience of already seeing some scenes shot data, somehow it’s always different. You find out that you have not really explored the scene language to its fullest extent. You may have been thinking about it incorrectly, or you may simply not have discovered one of the variations which now in context with everything else that you have shot elicited is simply better than anything you had previously thought of. The reality of the final moment, just before shooting, is so powerful that all previous analysis must yield before the impressions you receive under these circumstances, and unless you use this feedback to your positive advantage, unless you adjust to it, adapt to it and accept the sometimes terrifying weaknesses it can expose, you can never realize the most out of your film fieldwork.

Stanley Kubrick

The new Apple campus and the Pentagon compared

Wednesday, June 8th, 2011

+ + =

That is all.

Voicemail from Jesse

Saturday, July 3rd, 2010

My friend Jesse left me a voicemail on my Google voice number. Here’s a demo of the fantastic transcription feature.

Voicemail from Jesse from mitcho on Vimeo.

Disgusting Word-formatted HTML and how to fix it

Wednesday, December 30th, 2009

In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:

  1. Confusing ids and classes. ids should be unique on the page… but here’s an instance of using multiple instances of the same id in order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div>
</div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
  1. Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p>
<p class=MsoNormal>Steven Paul Abney</p>
<p class=MsoNormal>May 1987</p>
  1. Using blank space for formatting.
<p class=MsoNormal><o:p>&amp;nbsp;</o:p></p>
  1. CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.<span
style='mso-spacerun:yes'>  </span>It is well-known that a higher tense affects
the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span>What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> 
</span&gt

The solution

My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.

You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,

./cleanwordhtml.pl source.html > clean.html

I used this with a simple bash for loop to run over all my files:

for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;

Hopefully someone else can benefit from my experience.

My friend Evan bought an iPhone

Sunday, December 13th, 2009

tweeting-3.png

Mozilla By The Numbers

Sunday, September 6th, 2009

About six months ago I started working for Mozilla Labs full-time, focusing on Ubiquity, the multilingual natural language interface for the browser. This week marked my last week on contract as I go back to grad school next week. While the work will go on and I hope to continue to stay involved as time allows, here’s a quick bird’s eye view of my activities in my Mozilla tenure:


Time working for Mozilla: 6.5 months

Mozilla-related blog posts written: 69

Academic papers written on Ubiquity: 1

Ubiquity presentations given: 5

Screencasts made: 8

Most popular video on Vimeo: Ubiquity 0.5 日本語紹介ビデオ, the Japanese Ubiquity 0.5 introduction video: 2252 views

Languages Ubiquity commands and parser now support: 6

Commits to the Ubiquity repository: 492

Other web projects started during this period: 2+ (Ten Grand Is Buried There, HookPress)

TechCrunch references: 2 (1, 2)

Countries worked in: 2

Mythical Kiwis worked with: 1

References to bugs I introduced as “glitcho”s: 1

Extremely disturbing homages to me and Django: 1

Friends made; experience gained; lessons on Open-ness learned; personal growth: priceless enumerable


Thanks to all who made this experience amazing, beginning with Aza, Jono, Atul, Blair and the rest of the Labs team; intern extraordinaire Brandon; the always thoughtful and friendly Mozilla Japan team; and of course the fantastic Ubiquity community! Please visit me in Boston—I should be around for a while. ;)

Scoring for Optimization

Friday, April 24th, 2009

Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate’s score. You’re only interested in the top [latex]n[/latex] candidates. You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier without sacrificing quality. Such is the problem of scoring and ranking suggestions in Ubiquity. What properties must such a scoring system have?

This blog post includes a lot of complex CSS-formatted graphs which may be best viewed in — what else? — Firefox. You may also want to access this blog post directly rather than through a planet.

candidate 8  
candidate 2  
candidate 9  
candidate 3  
candidate 10 CUTOFF
candidate 5 
candidate 1 
candidate 7 
  

One portion of the problem description above merits clarification: I define “without sacrificing quality” to mean that, if we did not throw out any candidates early and waited until all the scores are computed fully and accurately, we would still yield the same top [latex]n[/latex] winners. This already gives us the key insight towards an appropriate solution: we can only throw out candidates when we know that it has no further chance of making it up into top [latex]n[/latex] candidates.

(more…)

Attachment Ambiguity—or—when is the gyudon cheap?

Wednesday, April 15th, 2009

yoshinoya.jpg

Every day on the way to work I walk by a fine establishment known as [[Yoshinoya]] (吉野家), Japan’s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by [[gyudon]], it’s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.

Part of this sign, though, made me think about our new Ubiquity parser. In particular, it was the attachment ambiguity in the end date of the promotion. The text in the photo above literally is “April 15th (Wed.) 8PM until”. (Note that Japanese is a strongly head-final language, and that the “until” is a postposition.) There are two possible readings for this expression, as illustrated by the two [[principle of compositionality|composition]] trees below.

(more…)

Scoring and Ranking Suggestions

Tuesday, April 7th, 2009

I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I’d put some of these thoughts down in writing.

The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four “scores” for each suggestion:

(more…)

Where’s The Verb?

Wednesday, March 25th, 2009

Ubiquity’s proposed new parser design is based on a [[principles and parameters]] philosophy: we can build an underlying universal parser and, for each individual language, we simply set some “parameters” to tell the parser how to act. As we consider the design’s pros and cons, it’s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.

Today I’ll examine highlight some disparate typological data to help us understand these questions: where’s the verb? and what does the verb look like? (more…)

Unnatural by design

Sunday, March 1st, 2009

I’m flying over the pacific ocean right now but a little bit of language caught my eye. Here’s a picture of the menu for this flight, in three languages: English, Japanese, Chinese.

menu.jpg

What caught my eye is the line “served with ご一緒に 配,” meant to be read as part of “Beef in BBQ sauce… served with Pepsi…”. The Chinese 配 (pèi) is fine here, meaning “with,” but the Japanese “ご一緒に” (goissho-ni) seemed awkward to me.

(more…)

Gaba, Shame On You

Monday, January 12th, 2009

A Gaba ad on a train

Here’s a picture of an ad for [[Gaba]], a big English conversation school in Japan, I snapped on a train recently. I felt the English sentence about Gaba’s satisfaction was extremely awkward, so I put it up on twitter to check with some other native speakers. My friends concurred. What do you think?

I personally think the sentence would be improved by removing the “the” in “the satisfaction.” Others offered “continues to rise” as possibly preferable to “continually rise.” English articles, especially the definiteness of abstract nouns, is very difficult for many non-native speakers. That being said, it’s sad for a sentence of such questionable acceptability to come from a company which, in theory, prides itself in its English ability and surely hires many native speakers. Gaba, shame on you.

This is what a release looks like

Wednesday, December 10th, 2008

This is what the latest release (2.1.6) of my Yet Another Related Posts Plugin looked like under Mint, using my WordPress plugin downloads pepper, which in turn gets its data from wordpress.org:

It’s always interesting to see these release spikes in download traffic. Note that this release was on the Wednesday but that was during the day, so Wednesday’s traffic is still higher than the normal ~300/day level, while the big peak (by day) is on Thursday. Too bad wordpress.org doesn’t give me hourly stats, though I guess that would be a little ridiculous.

YARPP is just about at that 35k download mark. I’m looking forward to the next release. ^^

Bald Moves

Friday, October 24th, 2008

On September 19th, Treasury Secretary Henry Paulson made a speech regarding the [[Troubled Assets Relief Program]] (TARP) to allay the fears of investors:

I am convinced that this bald approach will cost American families far less than the alternative—a continuing series of financial institution failures and frozen credit markets unable to fund economic expansion.

Unfortunately, the key phrase in this passage was widely mistranscribed in the media as a “bold approach.” But now that more details of the new Troubled Asset Relief Program have being released, Secretary Paulson’s true intentions are clear.

Chris Carey of Bailout Sleuth writes:

The Treasury Department tapped James H. Lambright [above center], head of the Export-Import Bank, as the interim chief investment officer for the $700 billion Troubled Asset Relief Program… The bailout program is being directed by Neel Kashkari [above left], who had been senior advisor to Treasury Secretary Henry M. Paulson Jr [above right].

Will this new program stem the global credit crisis? Maybe. But at least we can all agree… it’s a bald move.

回収 vs. 収集 and Better Word Meanings Through Usage

Thursday, September 18th, 2008

Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:

  • 回収 may take things away from others when collecting while 収集 does not have that implication.
  • Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1

Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: [[Google]].2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.

(more…)


  1. This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.” 

  2. Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the [[n-gram]] data they offer for research. 

  3. [[collocation|”Collocation” on Wikipedia]] says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.”