Nountype Quirks: Day 3: Geo Day

Aug 1, 2009

It’s time for one more installment of Nountype Quirks, where I review and tweak Ubiquity’s built-in nountypes. For an introduction to this effort, please read Judging Noun Types and my updates from Day 1 and Day 2.

Today I ended up spending most of the day attempting to implement (but not yet completing) major improvements to the geolocation-related nountypes whose plans I lay out here.

Note: this blog post includes a number of graphs using HTML/CSS formatting. If you are reading this article through a feed reader or planet, I invite you to read it on my site.

`noun_type_geolocation`

noun_type_geolocation is the nountype used by the weather command for its location argument in input like “weather near Chicago”. The neat feature of noun_type_geolocation is that it has a smart default value which uses Firefox’s geolocation system to give you your current location by default, so I can enter “weather” and get the suggestion “weather near Broomfield, Colorado” (not completely correct, but close enough for the weather). Otherwise, however, noun_type_geolocation does not do too hot… for any input you give it, it’ll just accept it with a score of 0.3, much like noun_arb_text. We could do better.

One issue with this noun_type_geolocation is a conceptual one. Is this nountype supposed to accept only municipalities? Countries? Or should it accept landmarks or addresses as well? Part of the issue is that it’s only used by one built-in command in Ubiquity now, weather. But to be called a general “geolocation” nountype, its output should not be specific to weather’s usage, which is to throw the result at the Weather Underground API.

I propose that we change this to be something like noun_type_geo_town and also make similar nountypes like noun_type_geo_country, noun_type_geo_region, going all the way down to noun_type_address (which already exists—see below). All of the nountypes in this family could use a geocoding API such as Google’s or Yahoo’s. Their data properties could include all of this geocoded geographic data (in English) and also the latitude/longitude coordinate data.

The weather command could then accept noun_type_geo_town but, as some municipalities are not in Weather Underground or, for some countries, it is only as granular as administrative districts, we could just display the results of the geocoding API but then give Weather Underground the geocoded latitude/longitude data.

`noun_type_async_address`

noun_type_async_address attempts to do exactly what I’ve laid out above for the most granular level: that of geolocations with data all the way down to the street level. This is the nountype which is used for the built-in map command and uses the Yahoo geocoding service to accomplish this. Let’s see what kinds of results it returns:

input	suggestion
mitcho	mitcho	0.5
grenada	grenada	0.9
jono	jono	0.9
mountain view	mountain view	0.9

Let’s lay out some immediate quirks:

All scores are either 0.5 or 0.9. In general, if the Yahoo API returns some geocoded interpretation, it gets 0.9, but otherwise it accepts everything with 0.5.
The results that came back from the Yahoo service doesn’t add any useful information like the country or administrative region. Even the case stays lowercase.
Since when is Jono a location!? I’ll get back to this later.

For starters, the Yahoo! Maps API terms of service dictate that we can’t use its geocoding service if we’re not also displaying Yahoo maps, so I rewrote it using the Google API which also had the advantage of offering JSON output.

One quirk of the Google Geocoding API, though, is that all of the resulting municipality names are only in English. Try for example queries for Wien or 東京 (Tokyo). Since we want our suggestions to only add information to our input, not replace the input entirely (and especially not in another language), we’ll then only take results which have the input as an initial substring. On the other hand, if none of the results have the input as a proper prefix of the return value, we will take the geocoding information from the first result but with the original input as the display text. Such results will have a markedly lower score.¹

As this is the address nountype, we’ll penalize results which do not have detailed information such as street address or town-level information. All of this is very easy to judge as every result from the API has a geocoding accuracy value.

The best laid plans of mice and men…

I spent a good few hours this afternoon and evening attempting to implement this new family of nountypes, including this new nountype_geo_address, but also nountype_geo_subregion, nountype_geo_region, and nountype_geo_country. Some of the quirks of the weather and map commands, however, have prevented me from completely replacing the legacy noun_type_address and noun_type_geolocation described above. I hope to continue this work again soon and actually make this transition, ideally before 0.5.2.

Look forward to one (or maybe two?) more episode(s) of Nountype Quirks where I hope to definitively explain, analyze, and tweak matchScore, the scoring algorithm which underlies the majority of the nountypes in Ubiquity. As always, I look forward to your comments and feedback.

Bonus: Where’s Jono?

It turns out that noun_type_async_address was recognizing “Jono” as an address because Jono is actually a location afterall! Not only that, but Jono is in Japan!!

Picture 3.png

You clearly can’t take Japan out of Jono, but it turns out you can’t take Jono out of Japan either.

If this crazy algorithm raises a red flag for anyone, you’re not alone… if you think of a more elegant solution, please let me know. This will no doubt be an issue when it comes to localizing the address nountype as well. I wish we could specify an output language for the Google Geocoding API… :( ↩