Nountype Quirks: Day 2

Jul 30, 2009

Today I’m continuing the process of reviewing and tweaking all of the nountypes built-in to Ubiquity. For a more respectable introduction to this endeavor, please read my blog post from a couple days ago, Judging Noun Types and my status update from yesterday, Nountype Quirks: Day 1.

Note: this blog post includes a number of graphs using HTML/CSS formatting. If you are reading this article through a feed reader or planet, I invite you to read it on my site.

`noun_type_twitter_user`

Let’s begin again by considering the suggestions and scores that a variety of different inputs to this nountype return and see what quirks we find.

To test this nountype, I made sure I had logged into Twitter once with the login mitchoyoshitaka.

input	suggestion
mitcho	mitchoyoshitaka	0.85
mitcho	mitcho	0.5
mitchoyoshi	mitchoyoshitaka	0.94
mitchoyoshi	mitcho	0.5
test	test	0.5
テスト	none
hello world	none
@test	none

As nountypes go, this is looking pretty good. For usernames which look like logins we’ve saved before, we’re using matchScore to get decent differential scores.¹ It’s even ruling out impossible twitter username strings, according to Twitter’s own restriction:

One possible improvement we could make is to let @ strings be accepted. I went ahead and made this improvement. The initial @ will be stripped off and then will be checked as normal, but the final score will receive a slight boost using an [[nth_root nth root]] formula. The twitter command was also updated to deal with inputs with and without the initial @.

input	suggestion
mitcho	mitchoyoshitaka	0.85
mitcho	mitcho	0.5
@mitcho	@mitchoyoshitaka	0.88
@mitcho	@mitcho	0.57
test	test	0.5
@test	@test	0.57

Although the noun_type_twitter_user nountype is currently most used by the built-in twitter command to specify the user’s username, in theory it could also be used for example in a command which pulls up another user’s tweets. With that in mind, perhaps in the future we could check the browser history and/or bookmarks for entries of the form http://twitter.com/... and suggest those as well (trac #846).

`noun_type_number`

input	suggestion
text	none
0.5	0.5	1
0.5.1	none

This nountype has an incredibly simple job and does it with ease. I’m going to leave it alone.

`noun_type_date` and `noun_type_time`

noun_type_date and noun_type_time both use the magical Date.parse method to parse date- and time-like strings. Let’s first take a look at some of its suggestions:

input	`date` suggestion	`time` suggestion
June 8th 5pm	2009-06-08	05:00 PM	1
5pm	2009-07-30	05:00 PM	1
5	2009-07-05	12:00 AM	1
June 8th	2009-06-08	12:00 AM	1
today	2009-07-30	12:00 AM	1
now	2009-07-30	02:40 PM	1
5pm is a good time	none	none

The quirks in these outputs can be summed up into these two factors:

There is no differential scoring at all.
Both nountypes parse the input with Date.parse and then just spit out the date or time components of the result. Thus time-only inputs get the default date and date-only inputs get the default time with equal scores.

I just rewrote both nountypes and also added a new noun_type_date_time. Here are some of the features of the new implementation:

If the input only contains digits and spaces, it is marked down.
With the exception of the outputs ‘today’ and ‘now’, if the resulting Date object’s date is today, its date suggestion is scored lower; equivalently for time being the default value, “12:00 AM”.
Scores (with the exception of ‘today’ and ‘now’) which are shorter than the output string get a slight penalty. This factor reflects the intuition that a longer output than input means some generic information was added and thus there is less confidence in the output.

Here’s what some of the inputs give now:

input	suggestion
June 8th 5pm	`date`: 2009-06-08	0.7
	`time`: 05:00 PM	0.7
	`date_time`: 2009-06-08 05:00 PM	0.86
5pm	`date`: 2009-07-30	0.27
	`time`: 05:00 PM	0.81
	`date_time`: 2009-07-30 05:00 PM	0.49
5	`date`: 2009-07-05	0.53
	`time`: 12:00 AM	0.19
	`date_time`: 2009-07-05 12:00 AM	0.34
June 8th	`date`: 2009-06-08	0.95
	`time`: 12:00 AM	0.35
	`date_time`: 2009-06-08 12:00 AM	0.58
today	`date`: 2009-07-30	1
	`time`: 12:00 AM	0.45
	`date_time`: 2009-06-08 12:00 AM	0.7
now	`date`: 2009-07-30	0.7
	`time`: 12:00 AM	1
	`date_time`: 2009-06-08 04:34 PM	1

In addition, looking to the future we’d like to make nountypes localizable as well, and these two nountypes in particular will surely require some good thinking and planning to make localizable.

`noun_type_email` and `noun_type_contact`

noun_type_email and noun_type_contact are two closely related nountypes. noun_type_email simply validates email address-looking strings, while noun_type_contact will return the noun_type_email suggestions and additionally return contacts from GMail if available.

The first thing to note is that I’ve often found the GMail contact lookup to be finicky in my own use. Reading through the code, I discovered the solution: GMail must either be open in a tab or you must use the “stay signed in” option and close the GMail tab.² With this mystery solved, and some code cleanup done to this contact fetching, let’s take a look at some example suggestions: (suggestions overlapping with noun_type_email are not listed here)

input	suggestion
aza@m	aza@mozilla.com	0.42
jono	jdicarlo@mozilla.com	0.28
jdicarlo	jdicarlo@mozilla.com	0.19

In general, we see that these scores all look pretty poor. In particular, though, note that the “jono” input yielded a higher score for the same suggestion than “jdicarlo”, even though “jdicarlo” is longer and thus, intuitively, has more informational content and should maybe do better. Digging into the code I realized why this is. It was computing the scores by comparing “jono” and “jdicarlo” not simply to “Jono DiCarlo” and “jdicarlo@mozilla.com” respectively, but to the combined string “Jono DiCarlo jdicarlo@mozilla.com”. Now with this change in place, both the email address and name are analyzed individually and, due to the way nountype detection works in Parser 2, no duplicates are returned. Here are the updated results:

input	suggestion
jono	jdicarlo@mozilla.com	0.83
jdicarlo	jdicarlo@mozilla.com	0.85

That’s much better!

Now let’s consider the suggestions from noun_type_email. Here are what they originally looked like:

input	suggestion
bpung	none
bpung@m	bpung@m	1
bpung@mozilla.com	bpung@mozilla.com	1

noun_type_email is based on a very robust regular expression for RFC 2822. Unfortunately this means that it completely rules out strings such as “bpung” which could be a proper prefix of an email address—something that I’ve advocated for avoiding before (see footnote 2 of Judging Noun Types). Moreover, due to a quirk of how nountypes based on regular expressions are scored, all results are given the score of 1.

I just committed a change so that this behavior is improved. The new version accepts strings which match the username part of the email address spec sans @ and domain, but with a great score penalty.³ Moreover, domains which do not have a final label (the [[top level domain]]) with more than one letter (unless it’s an IP address) or do not have any periods (.) in the domain will be penalized as well. Here’s what the same inputs produce now:

input	suggestion
bpung	bpung	0.3
bpung@m	bpung@m	0.8
bpung@mozilla.com	bpung@mozilla.com	1

Same time, same channel

I hope this post sheds light on the many changes I made together as well as the underlying thought process. If you don’t agree with any particular fix or analysis, please comment! I’ll be back again tomorrow with another installment of Nountype Quirks. Stay tuned!

Again, matchScore will be the subject of another blog post in the near future. ↩
Moreover, due to the way noun_type_contact caches the contact list internally, as long as GMail’s contacts are available once, you should be able to continue accessing those contacts’ suggestions after logging out of GMail. There are also great performance benefits to this caching. The downside is that we currently have no way to know when to clear the cache, so even if you update your contacts in GMail, those new contacts won’t appear in Ubiquity until you restart Firefox. ↩
Perhaps this is a horrible idea, because if executed or previewed, any verb which uses these nountypes would have to deal with arguments which are not valid email addresses. In my mind, though, as long as it doesn’t actually cause any error, this should be okay. Keep in mind that, given the very low scores given to these suggestions, parses using it would most likely only show up if the verb which requires these nountypes was explicitly given and there are other arguments as well, for example in input like “email hello to bpung”. In such a situation, we would rather this suggestion not disappear until we type “@m”. If executed, the built-in email verb, for instance, will deal with this gracefully by simply putting the incomplete email address in the To field. ↩

noun_type_twitter_user

noun_type_number

noun_type_date and noun_type_time

noun_type_email and noun_type_contact

Same time, same channel

`noun_type_twitter_user`

`noun_type_number`

`noun_type_date` and `noun_type_time`

`noun_type_email` and `noun_type_contact`