Today I’m continuing the process of reviewing and tweaking all of the nountypes built-in to Ubiquity. For a more respectable introduction to this endeavor, please read my blog post from a couple days ago, Judging Noun Types and my status update from yesterday, Nountype Quirks: Day 1.
Note: this blog post includes a number of graphs using HTML/CSS formatting. If you are reading this article through a feed reader or planet, I invite you to read it on my site.
Let’s begin again by considering the suggestions and scores that a variety of different inputs to this nountype return and see what quirks we find.
As nountypes go, this is looking pretty good. For usernames which look like logins we’ve saved before, we’re using
matchScore to get decent differential scores.1 It’s even ruling out impossible twitter username strings, according to Twitter’s own restriction:
|One possible improvement we could make is to let @ strings be accepted. I went ahead and made this improvement. The initial @ will be stripped off and then will be checked as normal, but the final score will receive a slight boost using an [[nth_root||nth root]] formula. The
noun_type_twitter_user nountype is currently most used by the built-in
http://twitter.com/... and suggest those as well (trac #846).
This nountype has an incredibly simple job and does it with ease. I’m going to leave it alone.
noun_type_time both use the magical Date.parse method to parse date- and time-like strings. Let’s first take a look at some of its suggestions:
|June 8th 5pm||2009-06-08||05:00 PM||1|
|June 8th||2009-06-08||12:00 AM||1|
|5pm is a good time||none||none|
The quirks in these outputs can be summed up into these two factors:
- There is no differential scoring at all.
- Both nountypes parse the input with Date.parse and then just spit out the date or time components of the result. Thus time-only inputs get the default date and date-only inputs get the default time with equal scores.
I just rewrote both nountypes and also added a new
noun_type_date_time. Here are some of the features of the new implementation:
- If the input only contains digits and spaces, it is marked down.
- With the exception of the outputs ‘today’ and ‘now’, if the resulting
Dateobject’s date is today, its date suggestion is scored lower; equivalently for time being the default value, “12:00 AM”.
- Scores (with the exception of ‘today’ and ‘now’) which are shorter than the output string get a slight penalty. This factor reflects the intuition that a longer output than input means some generic information was added and thus there is less confidence in the output.
Here’s what some of the inputs give now:
|June 8th 5pm||
In addition, looking to the future we’d like to make nountypes localizable as well, and these two nountypes in particular will surely require some good thinking and planning to make localizable.
noun_type_contact are two closely related nountypes.
noun_type_email simply validates email address-looking strings, while
noun_type_contact will return the
noun_type_email suggestions and additionally return contacts from GMail if available.
The first thing to note is that I’ve often found the GMail contact lookup to be finicky in my own use. Reading through the code, I discovered the solution: GMail must either be open in a tab or you must use the “stay signed in” option and close the GMail tab.2 With this mystery solved, and some code cleanup done to this contact fetching, let’s take a look at some example suggestions: (suggestions overlapping with
noun_type_email are not listed here)
In general, we see that these scores all look pretty poor. In particular, though, note that the “jono” input yielded a higher score for the same suggestion than “jdicarlo”, even though “jdicarlo” is longer and thus, intuitively, has more informational content and should maybe do better. Digging into the code I realized why this is. It was computing the scores by comparing “jono” and “jdicarlo” not simply to “Jono DiCarlo” and “firstname.lastname@example.org” respectively, but to the combined string “Jono DiCarlo email@example.com”. Now with this change in place, both the email address and name are analyzed individually and, due to the way nountype detection works in Parser 2, no duplicates are returned. Here are the updated results:
That’s much better!
Now let’s consider the suggestions from
noun_type_email. Here are what they originally looked like:
noun_type_email is based on a very robust regular expression for RFC 2822. Unfortunately this means that it completely rules out strings such as “bpung” which could be a proper prefix of an email address—something that I’ve advocated for avoiding before (see footnote 2 of Judging Noun Types). Moreover, due to a quirk of how nountypes based on regular expressions are scored, all results are given the score of 1.
I just committed a change so that this behavior is improved. The new version accepts strings which match the username part of the email address spec sans @ and domain, but with a great score penalty.3 Moreover, domains which do not have a final label (the [[top level domain]]) with more than one letter (unless it’s an IP address) or do not have any periods (.) in the domain will be penalized as well. Here’s what the same inputs produce now:
Same time, same channel
I hope this post sheds light on the many changes I made together as well as the underlying thought process. If you don’t agree with any particular fix or analysis, please comment! I’ll be back again tomorrow with another installment of Nountype Quirks. Stay tuned!
matchScorewill be the subject of another blog post in the near future. ↩
Moreover, due to the way
noun_type_contactcaches the contact list internally, as long as GMail’s contacts are available once, you should be able to continue accessing those contacts’ suggestions after logging out of GMail. There are also great performance benefits to this caching. The downside is that we currently have no way to know when to clear the cache, so even if you update your contacts in GMail, those new contacts won’t appear in Ubiquity until you restart Firefox. ↩
Perhaps this is a horrible idea, because if executed or previewed, any verb which uses these nountypes would have to deal with arguments which are not valid email addresses. In my mind, though, as long as it doesn’t actually cause any error, this should be okay. Keep in mind that, given the very low scores given to these suggestions, parses using it would most likely only show up if the verb which requires these nountypes was explicitly given and there are other arguments as well, for example in input like “email hello to bpung”. In such a situation, we would rather this suggestion not disappear until we type “@m”. If executed, the built-in email verb, for instance, will deal with this gracefully by simply putting the incomplete email address in the To field. ↩