Nountype Quirks: Day 1

Jul 29, 2009

Today I began the process of going through all of the nountypes built-in to Ubiquity using the principles and criteria I laid out yesterday—a task I’ve had in planning for a while now. As I explained yesterday, improved suggestions and scoring from the built-in nountypes could directly translate to better and smarter suggestions, resulting in a better experience for all users. Here I’ll document some of the nountype quirks I’ve discovered so far and what remedy has been implemented or is planned.

Note: this blog post includes a number of graphs using HTML/CSS formatting. If you are reading this article through a feed reader or planet, I invite you to read it on my site.

`noun_type_percentage`

Here’s what a few different inputs originally returned:

input	suggestion
20	20%	1
20%	20%	1
0.2	20%	1
0.2%	20%	1
20.0	2000%	1
2 hens in the garden	2%	1

Let me highlight a couple obvious quirks:

In certain cases, where the numerical expression includes a decimal and is less than one, it is interpreted as a proportional, rather than percent, value, e.g. “0.2” → “20%”. “0.2%” is not even an option. This is the case even when explicitly adding a % sign.
All suggestions, including those where the numeral was extracted from a long string of text (e.g. “2 hens in the garden”), get the same score of 1.

I just committed a fix so noun_type_percentage now…

Counts the number of characters in the input which match [\d.%] and caps the score by (number of acceptable characters)/(length of input).
Strings which do not include “%” get a 10% penalty.
In the case of decimals less than 1 without a % sign, the proportion interpretation is also suggested (e.g. “0.2” → “20%”) in addition to the original suggestion (“0.2%”), but with a slight penalty.

Here is what they now return:

input	suggestion
20	20%	0.9
20%	20%	1
0.2	0.2%	0.9
0.2	20%	0.81
0.2%	0.2%	1
20.0	20%	0.9
2 hens in the garden	2%	0.05

`noun_type_tag`

Here’s what a few different inputs originally returned. Keep in mind that currently in this test profile, the preexisting tags are “animal”, “help”, “test”, and “ubiquity”.

input	suggestion
animal	animal	0.3
mineral	mineral	0.3
anim	animal	0.7
anim	anim	0.3
help, test, ubiq	help,test,ubiquity	0.7
help, test, ubiq	help,test,ubiq	0.3
google, yahoo, ubiq	google,yahoo,ubiquity	0.7
google, yahoo, ubiq	google,yahoo,ubiq	0.3
google, , yahoo	google,yahoo	0.3

Here are a few of noun_type_tag’s quirks:

There are only two scores ever given out: 0.3 and 0.7.
Only the last tag in the list and whether it exists or not is taken into account.
When the last tag is incomplete, the completion is suggested with a higher score, but if the last tag is exactly equal to an existing tag, it gets the lower score.

Ideally, we want noun_type_tag to look at each of the tags given to it, with higher scores for when there are more preexisting tags and fewer new ones. Keep in mind, though, that we only have to suggest the completion of the very last tag as that may be one where the user hasn’t completed typing yet… for earlier tags, we can assume (safely or not) that the user placed the comma where they meant to. We can’t teach Ubiquity to read minds, after all.¹

With this in mind, I just made a change to noun_type_tag which aims to follow these principles. The basic idea is that we start with a base score of 0.3 but then raise it via [[nth root nth root]] for every tag in the sequence which is preexisting. Here’s what the same inputs return now. Recall that the preexisting tags are “animal”, “help”, “test”, and “ubiquity”.

input	suggestion
animal	animal	0.55
mineral	mineral	0.3
anim	animal	0.55
anim	anim	0.3
help, test, ubiq	help,test,ubiquity	0.86
help, test, ubiq	help,test,ubiq	0.74
google, yahoo, ubiq	google,yahoo,ubiquity	0.55
google, yahoo, ubiq	google,yahoo,ubiq	0.3
google, , yahoo	google,yahoo	0.3

`noun_type_awesomebar`

input	suggestion
moz	http://www.mozilla.com/	0.8
	https://wiki.mozilla.org/Labs/Ubiquity/ Parser_2_API_Conversion_Tutorial	0.8
	http://en-us.start3.mozilla.com/ firefox?client=firefox-a&rls= org.mozilla:en-US:official	0.8
	http://en-us.www.mozilla.com/en-US/firefox/about/	0.8

There are a couple quirks here:

All suggestions are returned with the same scores.
The nountype returns the URL of the entry as the HTML-formatted result and the title as the text-formatted result, which clearly does not make sense. However, it’s not clear to me whether the title, URL, or some combination of both is what we should be returning as the suggestion text presented to the user.²

I just rewrote noun_type_awesomebar to actually do some differential scoring. This new version also presents the URL or title depending on whichever had a better match using the matchScore function.³

input	suggestion
moz	www.mozilla.com	0.7
	https://wiki.mozilla.org/Labs/Ubiquity/ Parser_2_API_Conversion_Tutorial	0.63
	http://en-us.start3.mozilla.com/ firefox?client=firefox-a&rls= org.mozilla:en-US:official	0.61
	http://en-us.www.mozilla.com/en-US/firefox/about/	0.6

`noun_type_url`

The purpose of noun_type_url’s suggest function is two-fold: first, to accept strings which may look like a URL and, second, to suggest URL’s from the history just like noun_type_url, but only based on URL matches and not title matches.⁴ Here are a few sample inputs:

input	suggestion
moz	http://www.mozilla.com/	0.9
	http://moz	0.5
	https://wiki.mozilla.org/Labs/Ubiquity/ Parser_2_API_Conversion_Tutorial	0.9
	http://en-us.start3.mozilla.com/ firefox?client=firefox-a&rls= org.mozilla:en-US:official	0.9
	http://en-us.www.mozilla.com/en-US/firefox/about/	0.9
test	http://test	0.5
http://	http://	0.5
http:	http:	0.5
http	http	0.5
_test	http://_test	0.5
hello world!	http://hello world!	0.5

Oh, where to begin!? Here are some initial quirks… it’s possible that you could think of more!

There is no differential scoring… only 0.9 for suggestions from history and 0.5 for URL-like strings.
A number of invalid domain names are being accepted and turned into suggestions (“hello world!”, “_test”, etc.).
It’s trying to be smart by suggesting “http://” as a default [[URI scheme]] but doing so even for prefixes (initial substrings) of the word “http” itself.

With these thoughts in mind, I just took a first stab at improving this situation. Here are some features of the new implementation:

History entries are scored in the same way as in noun_type_awesomebar, using matchScore.
URLs without an explicit [[URI scheme]] (like “http://”) get a 10% penalty.
“http://” is only suggested if one of a long list of common URI schemes are not detected.
It repairs schemes which are missing a slash or two, suggesting for example “http:hello.com” → “http://hello.com”.
It actually uses Firefox’s own IDNService to check if the domain name is a valid [[internationalized domain name]]. If it’s an IDN as opposed to LDH (“letters, digits, and hyphens”), it gets a 10% penalty. If it’s not even a valid IDN, it is ruled out (see last two example inputs below).
There are also penalties for only being a domain name with no path and for the domain not having any periods (.) in it.

Here is what our suggestions now look like:

input	suggestion
moz	http://www.mozilla.com/	0.6
	http://moz	0.65
	https://wiki.mozilla.org/Labs/Ubiquity/ Parser_2_API_Conversion_Tutorial	0.63
	http://en-us.start3.mozilla.com/ firefox?client=firefox-a&rls= org.mozilla:en-US:official	0.61
	http://en-us.www.mozilla.com/en-US/firefox/about/	0.6
test	http://test	0.65
http://	http://	1
http://	shttp://	0.75
http:	http://	0.9
http:	shttp://	0.7
http	http://	0.72
	https://	0.71
	shttp://	0.68
	http://http	0.65
_test	none
hello world!	none

See you tomorrow~

Alright, enough nountype wrangling for one day. I’ll be back again tomorrow for another installment.

If we could make assumptions about what tags look like, for example that they are always pretty short, or use certain character classes, we could use such factors as well to judge non-preexisting tags for “tagginess” but unfortunately it’s possible (though unlikely) that a user would prefer really long tag strings and of course Firefox allows tags in any unicode code range. The only strings we can immediately rule out as impossible are ones which are purely whitespace. ↩
It’s actually unclear whether the method we’re using (nsIAutoCompleteSearch) is actually searching titles or not… it currently looks like it’s only looking at the URL’s. Perhaps the title query is what we’re supposed to enter in the mystery parameter. ↩
I hope to discuss the matchScore function in a separate blog post later. ↩
While writing up this section I ran into a bug whereby when both noun_type_awesomebar and noun_type_url are active, only one of their async callbacks from Utils.history.search are returned. Thus, if lucky, only one of the nountypes will return the history results and if unlucky the parse query will not complete. Filed as trac #845. ↩