# blog

## Archive for the ‘observation’ Category

+ + =

That is all.

### Voicemail from Jesse

My friend Jesse left me a voicemail on my Google voice number. Here’s a demo of the fantastic transcription feature.

### Disgusting Word-formatted HTML and how to fix it

In working on a new website for the MIT Working Papers in Linguistics, I recently inherited a collection of HTML files with all of our books’ abstracts. To my dismay (but not surprise) the markup in these files were horrendous. Here are some of the cardinal sins of markup that I saw committed in these files:

1. Confusing `id`s and `class`es. `id`s should be unique on the page… but here’s an instance of using multiple instances of the same `id` in order to format them together.
```<div id="indent"> <div id="number">4.2.1</div> <div id="page">161</div> <div id="section">Old French (Adams 1987)</div>
</div> <div id="indent"> <div id="number">4.2.2</div> <div id="page">164</div> <div id="section">The evolution of the dialects of northern Italy</div>```
1. Putting a class on every instance of something. Everything paragraph should be formatted equivalently. We get the point.
```<p class=MsoNormal><b>The English Noun Phrase in Its Sentential Aspect</b></p>
<p class=MsoNormal>Steven Paul Abney</p>
<p class=MsoNormal>May 1987</p>```
1. Using blank space for formatting.
`<p class=MsoNormal><o:p>&amp;nbsp;</o:p></p>`
1. CSS styles that don’t exist. Browsers just ignore these anyway…
```<p class=MsoNormal>One factor in determining which worlds a modal quantifies
over is the temporal argument of the modal’s accessibility relation.<span
style='mso-spacerun:yes'>  </span>It is well-known that a higher tense affects
the accessibility relation of modals.<span style='mso-spacerun:yes'>
</span>What is not well-known is that there are aspectual operators high enough
to affect the accessibility relation of modals.<span style='mso-spacerun:yes'>
</span&gt```

### The solution

My solution was to write a perl script which takes care of a number of these issues. It’s not foolproof and doesn’t involve any voodoo—for example, it can’t retypeset things which were formatted using whitespace—but it does a good job as a first pass.

You can run the script by making it executable (`chmod +x cleanwordhtml.pl`) then specifying a target filename as an argument. For example,

`./cleanwordhtml.pl source.html > clean.html`

I used this with a simple bash for loop to run over all my files:

`for f in */*.html; do ./cleanwordhtml.pl \$f > \${f%.html}-clean.html; done;`

Hopefully someone else can benefit from my experience.

### Mozilla By The Numbers

About six months ago I started working for Mozilla Labs full-time, focusing on Ubiquity, the multilingual natural language interface for the browser. This week marked my last week on contract as I go back to grad school next week. While the work will go on and I hope to continue to stay involved as time allows, here’s a quick bird’s eye view of my activities in my Mozilla tenure:

Time working for Mozilla: 6.5 months

Mozilla-related blog posts written: 69

Academic papers written on Ubiquity: 1

Ubiquity presentations given: 5

Most popular video on Vimeo: Ubiquity 0.5 日本語紹介ビデオ, the Japanese Ubiquity 0.5 introduction video: 2252 views

Languages Ubiquity commands and parser now support: 6

Commits to the Ubiquity repository: 492

Other web projects started during this period: 2+ (Ten Grand Is Buried There, HookPress)

TechCrunch references: 2 (1, 2)

Countries worked in: 2

Mythical Kiwis worked with: 1

References to bugs I introduced as “glitcho”s: 1

Extremely disturbing homages to me and Django: 1

Friends made; experience gained; lessons on Open-ness learned; personal growth: priceless enumerable

Thanks to all who made this experience amazing, beginning with Aza, Jono, Atul, Blair and the rest of the Labs team; intern extraordinaire Brandon; the always thoughtful and friendly Mozilla Japan team; and of course the fantastic Ubiquity community! Please visit me in Boston—I should be around for a while.

### Scoring for Optimization

Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate’s score. You’re only interested in the top $n$ candidates. You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier without sacrificing quality. Such is the problem of scoring and ranking suggestions in Ubiquity. What properties must such a scoring system have?

This blog post includes a lot of complex CSS-formatted graphs which may be best viewed in — what else? — Firefox. You may also want to access this blog post directly rather than through a planet.

candidate 8 CUTOFF

One portion of the problem description above merits clarification: I define “without sacrificing quality” to mean that, if we did not throw out any candidates early and waited until all the scores are computed fully and accurately, we would still yield the same top $n$ winners. This already gives us the key insight towards an appropriate solution: we can only throw out candidates when we know that it has no further chance of making it up into top $n$ candidates.

(続きを読む…)

### Attachment Ambiguity—or—when is the gyudon cheap?

Every day on the way to work I walk by a fine establishment known as Yoshinoya (吉野家), Japan’s largest gyudon (牛丼) chain restaurant. For those of you whose lives have yet to be graced by gyudon, it’s a bowl of rice topped with beef and onions stewed in a sweet-savory soy-based sauce. Loving gyudon and being a cheapskate, I naturally noticed the recent 50 yen off gyudon promotion at Yoshinoya. The above photo is a photo of part of that sign.

Part of this sign, though, made me think about our new Ubiquity parser. In particular, it was the attachment ambiguity in the end date of the promotion. The text in the photo above literally is “April 15th (Wed.) 8PM until”. (Note that Japanese is a strongly head-final language, and that the “until” is a postposition.) There are two possible readings for this expression, as illustrated by the two composition trees below.

(続きを読む…)

### Scoring and Ranking Suggestions

I just spent some time reviewing how Ubiquity currently ranks its suggestions in relation to to Parser The Next Generation so I thought I’d put some of these thoughts down in writing.

The issue of ranking Ubiquity suggestions can be restated as predicting an optimal output given a certain input and various conflicting considerations. Ubiquity (1.8, as of this writing) computes four “scores” for each suggestion:

(続きを読む…)

### Where’s The Verb?

Ubiquity’s proposed new parser design is based on a principles and parameters philosophy: we can build an underlying universal parser and, for each individual language, we simply set some “parameters” to tell the parser how to act. As we consider the design’s pros and cons, it’s important to reflect back on the linguistic data and see if this architecture can adequately handle the range of linguistic data attested in our languages.

Today I’ll examine highlight some disparate typological data to help us understand these questions: where’s the verb? and what does the verb look like? (続きを読む…)

### Unnatural by design

I’m flying over the pacific ocean right now but a little bit of language caught my eye. Here’s a picture of the menu for this flight, in three languages: English, Japanese, Chinese.

What caught my eye is the line “served with ご一緒に 配,” meant to be read as part of “Beef in BBQ sauce… served with Pepsi…”. The Chinese 配 (pèi) is fine here, meaning “with,” but the Japanese “ご一緒に” (goissho-ni) seemed awkward to me.

(続きを読む…)

### Gaba, Shame On You

Here’s a picture of an ad for Gaba, a big English conversation school in Japan, I snapped on a train recently. I felt the English sentence about Gaba’s satisfaction was extremely awkward, so I put it up on twitter to check with some other native speakers. My friends concurred. What do you think?

I personally think the sentence would be improved by removing the “the” in “the satisfaction.” Others offered “continues to rise” as possibly preferable to “continually rise.” English articles, especially the definiteness of abstract nouns, is very difficult for many non-native speakers. That being said, it’s sad for a sentence of such questionable acceptability to come from a company which, in theory, prides itself in its English ability and surely hires many native speakers. Gaba, shame on you.

### This is what a release looks like

This is what the latest release (2.1.6) of my Yet Another Related Posts Plugin looked like under Mint, using my WordPress plugin downloads pepper, which in turn gets its data from wordpress.org:

It’s always interesting to see these release spikes in download traffic. Note that this release was on the Wednesday but that was during the day, so Wednesday’s traffic is still higher than the normal ~300/day level, while the big peak (by day) is on Thursday. Too bad wordpress.org doesn’t give me hourly stats, though I guess that would be a little ridiculous.

YARPP is just about at that 35k download mark. I’m looking forward to the next release. ^^

### Bald Moves

On September 19th, Treasury Secretary Henry Paulson made a speech regarding the Troubled Assets Relief Program (TARP) to allay the fears of investors:

I am convinced that this bald approach will cost American families far less than the alternative—a continuing series of financial institution failures and frozen credit markets unable to fund economic expansion.

Unfortunately, the key phrase in this passage was widely mistranscribed in the media as a “bold approach.” But now that more details of the new Troubled Asset Relief Program have being released, Secretary Paulson’s true intentions are clear.

Chris Carey of Bailout Sleuth writes:

The Treasury Department tapped James H. Lambright [above center], head of the Export-Import Bank, as the interim chief investment officer for the \$700 billion Troubled Asset Relief Program… The bailout program is being directed by Neel Kashkari [above left], who had been senior advisor to Treasury Secretary Henry M. Paulson Jr [above right].

Will this new program stem the global credit crisis? Maybe. But at least we can all agree… it’s a bald move.

### 回収 vs. 収集 and Better Word Meanings Through Usage

Bailey just asked me what the difference between 回収 (kaishū) and 収集(shūshū) is—two words that would both map to the English verb “collect.” I intuitively came up with a hypothesis to explain the distinction:

• 回収 may take things away from others when collecting while 収集 does not have that implication.
• Things that you 回収 may have been previously distributed by the actor themself while 収集 does not have that implication.1

Not content with armchair theorizing, however, I decided to take advantage of one of the largest corpora in the world: Google.2 To test my hypothesis, I chose two “objects of collection”, one you can take away (and often is distributed first) and one you can’t take away: アンケート (ankēto “survey,” from the French enquête) and 意見 (iken “opinion”). I then took the four resulting collocations3 on Google in quotes (“•”) and recorded how many hits there were.

(続きを読む…)

1. This second point could also be hypothesized based on the component meaning of 回, which in the verb 回る (mawa=ru) can mean “circle back.”

2. Google is of course a huge corpus but it has very limited search and can easily be misused and misunderstood, thus making Google an unreliable (unprofessional) source for statistical data. One Google alternative for some different statistics is the n-gram data they offer for research.

3. ”Collocation” on Wikipedia says: “Within the area of corpus linguistics, collocation is defined as a sequence of words or terms which co-occur more often than would be expected by chance.”

### Oh Twitter, you’re so funny

I don’t think those two were related.

© 2006–2013 mitcho (Michael 芳貴 Erlewine).