blog
Archive for the ‘observation’ Category
Voicemail from Jesse
Saturday, July 3rd, 2010My friend Jesse left me a voicemail on my Google voice number. Here’s a demo of the fantastic transcription feature.
Voicemail from Jesse from mitcho on Vimeo.
Disgusting Word-formatted HTML and how to fix it
Wednesday, December 30th, 2009In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:
- Confusing
ids andclasses.ids should be unique on the page… but here’s an instance of using multiple instances of the sameidin order to format them together.
<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div> </div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>
- Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p> <p class=MsoNormal>Steven Paul Abney</p> <p class=MsoNormal>May 1987</p>
- Using blank space for formatting.
<p class=MsoNormal><o:p>&nbsp;</o:p></p>
- CSS styles that don’t exist. Browsers just ignore these anyway…
<p class=MsoNormal>One factor in determining which worlds a modal quantifies over is the temporal argument of the modal’s accessibility relation.<span style='mso-spacerun:yes'> </span>It is well-known that a higher tense affects the accessibility relation of modals.<span style='mso-spacerun:yes'> </span>What is not well-known is that there are aspectual operators high enough to affect the accessibility relation of modals.<span style='mso-spacerun:yes'> </span>
The solution
My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.
You can run the script by making it executable (chmod +x cleanwordhtml.pl) then specifying a target filename as an argument. For example,
./cleanwordhtml.pl source.html > clean.html
I used this with a simple bash for loop to run over all my files:
for f in */*.html; do ./cleanwordhtml.pl $f > ${f%.html}-clean.html; done;
Hopefully someone else can benefit from my experience.
My friend Evan bought an iPhone
Sunday, December 13th, 2009
Mozilla By The Numbers
Sunday, September 6th, 2009About six months ago I started working for Mozilla Labs full-time, focusing on Ubiquity, the multilingual natural language interface for the browser. This week marked my last week on contract as I go back to grad school next week. While the work will go on and I hope to continue to stay involved as time allows, here’s a quick bird’s eye view of my activities in my Mozilla tenure:
Time working for Mozilla: 6.5 months
Mozilla-related blog posts written: 69
Academic papers written on Ubiquity: 1
Ubiquity presentations given: 5
Screencasts made: 8
Most popular video on Vimeo: Ubiquity 0.5 日本語紹介ビデオ, the Japanese Ubiquity 0.5 introduction video: 2252 views
Languages Ubiquity commands and parser now support: 6
Commits to the Ubiquity repository: 492
Other web projects started during this period: 2+ (Ten Grand Is Buried There, HookPress)
TechCrunch references: 2 (1, 2)
Countries worked in: 2
Mythical Kiwis worked with: 1
References to bugs I introduced as “glitcho”s: 1
Extremely disturbing homages to me and Django: 1
Friends made; experience gained; lessons on Open-ness learned; personal growth: priceless enumerable
Thanks to all who made this experience amazing, beginning with Aza, Jono, Atul, Blair and the rest of the Labs team; intern extraordinaire Brandon; the always thoughtful and friendly Mozilla Japan team; and of course the fantastic Ubiquity community! Please visit me in Boston—I should be around for a while.
Scoring for Optimization
Friday, April 24th, 2009Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate’s score. You’re only interested in the top candidates. You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier without sacrificing quality. Such is the problem of scoring and ranking suggestions in Ubiquity. What properties must such a scoring system have?
This blog post includes a lot of complex CSS-formatted graphs which may be best viewed in — what else? — Firefox. You may also want to access this blog post directly rather than through a planet.
| candidate 8 | ||
|---|---|---|
| candidate 2 | ||
| candidate 9 | ||
| candidate 3 | ||
| candidate 10 | CUTOFF | |
| candidate 5 | ||
| candidate 1 | ||
| candidate 7 | ||
| … |
One portion of the problem description above merits clarification: I define “without sacrificing quality” to mean that, if we did not throw out any candidates early and waited until all the scores are computed fully and accurately, we would still yield the same top winners. This already gives us the key insight towards an appropriate solution: we can only throw out candidates when we know that it has no further chance of making it up into top
candidates.
Attachment Ambiguity—or—when is the gyudon cheap?
Wednesday, April 15th, 2009
Every day on the way to work I walk by a fine establishment known as Yoshinoya (吉野家), Japan’s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by gyudon, it’s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.
Part of this sign, though, made me think about our new Ubiquity parser. In particular, it was the attachment ambiguity in the end date of the promotion. The text in the photo above literally is “April 15th (Wed.) 8PM until”. (Note that Japanese is a strongly head-final language, and that the “until” is a postposition.) There are two possible readings for this expression, as illustrated by the two composition trees below.
Scoring and Ranking Suggestions
Tuesday, April 7th, 2009I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I’d put some of these thoughts down in writing.
The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four “scores” for each suggestion:
Where’s The Verb?
Wednesday, March 25th, 2009Ubiquity’s proposed new parser design is based on a principles and parameters philosophy: we can build an underlying universal parser and, for each individual language, we simply set some “parameters” to tell the parser how to act. As we consider the design’s pros and cons, it’s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.
Today I’ll examine highlight some disparate typological data to help us understand these questions: where’s the verb? and what does the verb look like? (more…)
Unnatural by design
Sunday, March 1st, 2009I’m flying over the pacific ocean right now but a little bit of language caught my eye. Here’s a picture of the menu for this flight, in three languages: English, Japanese, Chinese.

What caught my eye is the line “served with ご一緒に 配,” meant to be read as part of “Beef in BBQ sauce… served with Pepsi…”. The Chinese 配 (pèi) is fine here, meaning “with,” but the Japanese “ご一緒に” (goissho-ni) seemed awkward to me.
Gaba, Shame On You
Monday, January 12th, 2009
Here’s a picture of an ad for Gaba, a big English conversation school in Japan, I snapped on a train recently. I felt the English sentence about Gaba’s satisfaction was extremely awkward, so I put it up on twitter to check with some other native speakers. My friends concurred. What do you think?
I personally think the sentence would be improved by removing the “the” in “the satisfaction.” Others offered “continues to rise” as possibly preferable to “continually rise.” English articles, especially the definiteness of abstract nouns, is very difficult for many non-native speakers. That being said, it’s sad for a sentence of such questionable acceptability to come from a company which, in theory, prides itself in its English ability and surely hires many native speakers. Gaba, shame on you.
This is what a release looks like
Wednesday, December 10th, 2008This is what the latest release (2.1.6) of my Yet Another Related Posts Plugin looked like under Mint, using my WordPress plugin downloads pepper, which in turn gets its data from wordpress.org:

It’s always interesting to see these release spikes in download traffic. Note that this release was on the Wednesday but that was during the day, so Wednesday’s traffic is still higher than the normal ~300/day level, while the big peak (by day) is on Thursday. Too bad wordpress.org doesn’t give me hourly stats, though I guess that would be a little ridiculous.
YARPP is just about at that 35k download mark. I’m looking forward to the next release. ^^
Bald Moves
Friday, October 24th, 2008On September 19th, Treasury Secretary Henry Paulson made a speech regarding the Troubled Assets Relief Program (TARP) to allay the fears of investors:
I am convinced that this bald approach will cost American families far less than the alternative—a continuing series of financial institution failures and frozen credit markets unable to fund economic expansion.
Unfortunately, the key phrase in this passage was widely mistranscribed in the media as a “bold approach.” But now that more details of the new Troubled Asset Relief Program have being released, Secretary Paulson’s true intentions are clear.
Chris Carey of Bailout Sleuth writes:
The Treasury Department tapped James H. Lambright [above center], head of the Export-Import Bank, as the interim chief investment officer for the $700 billion Troubled Asset Relief Program… The bailout program is being directed by Neel Kashkari [above left], who had been senior advisor to Treasury Secretary Henry M. Paulson Jr [above right].
Will this new program stem the global credit crisis? Maybe. But at least we can all agree… it’s a bald move.
回収 vs. 収集 and Better Word Meanings Through Usage
Thursday, September 18th, 2008Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:
- 回収 may take things away from others when collecting while 収集 does not have that implication.
- Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1
Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: Google.2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.
-
This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.” ↩
-
Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the n-gram data they offer for research. ↩
-
”Collocation” on Wikipedia says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.” ↩





