blog

Scoring for Optimization

Suppose you have a number of competing candidates, each of which can be ranked with a score, but it takes a little time to calculate each candidate’s score. You’re only interested in the top n candidates. You want to come up with a scoring scheme where you can throw the extra candidates out of consideration earlier without sacrificing quality. Such is the problem of scoring and ranking suggestions in Ubiquity. What properties must such a scoring system have?

This blog post includes a lot of complex CSS-formatted graphs which may be best viewed in — what else? — Firefox. You may also want to access this blog post directly rather than through a planet.

candidate 8  
candidate 2  
candidate 9  
candidate 3  
candidate 10 CUTOFF
candidate 5 
candidate 1 
candidate 7 
  

One portion of the problem description above merits clarification: I define “without sacrificing quality” to mean that, if we did not throw out any candidates early and waited until all the scores are computed fully and accurately, we would still yield the same top n winners. This already gives us the key insight towards an appropriate solution: we can only throw out candidates when we know that it has no further chance of making it up into top n candidates.

Let’s get formal

Let’s call S_{i}(t) the score of candidate C_{i} at time t in the derivation and we’ll assume that the score derivations are done in parallel with a unique origin (t=0).1 We’ll use the notation S_{i}(\infty) to represent the equilibrium or final score, equal to S_{i}(t) for all t > a certain t^{\prime} which exists for each candidate. This function S_{i} thus defines a time series for each candidate.

Given a set of candidates \left\{C_1,C_2,\ldots,C_k\right\}, we want to find the best subset of n candidates; that is, \left\{C_{i_1},C_{i_2},\ldots,C_{i_n}\right\} such that

\forall_{ i\in \{i_1,\dots,i_n\}, j\in \{1,\dots,k\}\setminus\{i_1,\ldots,i_n\}} S_{i}(\infty) \geq S_{j}(\infty).

Approach 1: A Threshold Model

The key insight above would naturally give us what I call the threshold model. Here, we require the score sequences to be non-increasing: \forall_{t < t^{\prime}} S_{i}(t) < S_{i}(t^{\prime}). This way, we can naturally throw out candidates which have reached below a certain threshold M (or attained a certain level of badness, you might say) which we can then be sure will never recover.

For example, suppose the following diagram represents the scores of five different candidates after the first four time steps of the derivation. (The full gray bar marks the initial score (S_i(0)) and the arrows indicate the successive score differentials.) The vertical line marks the threshold, M.

candidate 1     
candidate 2     
candidate 3     
candidate 4     
candidate 5     
  

We can tell after four steps that candidates 2 and 4, given that the score sequences are non-increasing, have no chance to finish their derivation with a score > M. What is important to note, however, is that candidate 4 already had no chance of beating the threshold after three steps. There was no need to calculate the fourth derivation of the score of candidate 4 (S_{4}(4)). In other words, after three steps, we could completely take candidate 4 out of the running and after another step, take candidate 2 out of the running.

t=2t=3t=4
C1   
C2   
C3   
C4   
C5   
  
C1    
C2    
C3    
C4    
C5    
  
C1     
C2     
C3     
C4  
C5     
  

This non-decreasing score approach was used in Ubiquity Parser 2 until just recently, and you can in fact still play with it on the online Ubiquity Parser TNG demo. In that version, every parse started with an initial score of 1 and every score factor would be a value between 0 and 1. Every score factor was multiplied onto the previous score throughout the derivation, making it trivially non-increasing.

The problem with this approach is how to choose a smart threshold and that, given a constant threshold, you may get a different number of results for every different candidate set (i.e. parser query). If your score indicates a meaningful value with an a priori specified target of acceptable values, having a threshold makes sense. In the case of Ubiquity, however, the interface expects a certain number of suggestions to be returned.2 If we plan to display five suggestions but the parser only returns four, even though there were other candidates, there must be a very good reason and justification for that threshold value.

Approach 2: Raising the Bar

The problem with Approach 1 was that there was no way of guaranteeing that we would yield our predefined n winning candidates. Even if at some point in the derivation we are left with n candidates still above the threshold, as the only restriction we have is that our score series are non-increasing, there is still a possibility that those remaining n candidates’ scores will drop below M later in the derivation.

We must instead at some point in the derivation identify (a) a set of at least n candidates which will not get “worse” in the derivation and (b) candidates which have no chance of overtaking the (a) candidates. In this situation we can safely throw out the (b) candidates.

One way to do this is to require that all the scores S_{i}(t) are bounded and non-decreasing. By virtue of being non-decreasing, our top candidates at any point in our derivation will never get “worse” afterwards, satisfying condition (a). If relatively early in the computation we can compute a bound B_i, we can identify candidates which will never surpass the top candidates in group (a) above, satisfying condition (b).

In the example below, n=2 and the thin bars mark the upper bounds B_i. At t=1 we can identify candidate 2 and 4 as being our top two candidates. Note that there is one candidate, candidate 5, whose upper bound B_5 is less than both S_2(1) and S_4(1). By definition S_5(\infty) \leq B_5 and because the scores are non-decreasing S_2(1) \leq S_2(\infty) and S_4(1) \leq S_4(\infty). Therefore

S_5(\infty) < S_2(\infty) and S_5(\infty) < S_4(\infty)

and we can thus throw out candidate 5 at this point. By the same logic, after t=2 we can throw candidate 2 out of the running.

t=1t=2t=3
C1   
C2   
C3   
C4   
C5   
  
C1    
C2    
C3    
C4    
C5  
  
C1     
C2  
C3     
C4     
C5  
  

Calling this the “raising the bar” method refers to the fact that, at any particular time t, the “bar” is min\left(\left\{\mbox{the }n\mbox{ greatest }S_{i}(t)\mbox{ values}\right\}\right) and every other candidate must have an upper bound B_j greater than the bar in order to not be thrown out of consideration. This “bar” itself is, together with the component scores, non-decreasing, decreasing the number of surviving candidates over time.

In the case of the Ubiquity parser we could build such a non-decreasing and bounded scoring model by using an additive model. As the main component of parser scoring is how well the parsed arguments match the verbs’ specified nountypes, we could simply add up all the confidence scores of each nountype suggestion, each of which are a value between 0 and 1. This would trivially be non-decreasing. As each parse has a finite and known number of parsed arguments, we could easily determine a bound as well. For example, say a parse S_0 has two arguments. Before we check each of the nountypes’ match scores, we already know that S_0(\infty) \leq 2 = B_0.

Unfortunately, there are also other factors which we would like to consider in our parses which may not fit into this non-decreasing model so easily…

Approach 2’: The Rising Sun Model3

One problem with both of the previous approaches is that it requires that the scoring schemes be either non-increasing or non-decreasing across the derivation. There are many situations, however, where you would want different factors to affect the score both positively and negatively. In the case of the Ubiquity parser, here are some different factors which could be good positive and negative score factors in computing the score of each parse.

positive factorsnegative factors
the verb’s specified nountype matching the argument noun wellhaving to suggest the verb
the verb in the input matching the verb wellmultiple arguments parsed for a single semantic role
the verb being used oftenthe verb missing some arguments

As we see, there are both positive and negative factors which we hope to consider in scoring our possible Ubiquity parses. They key to making this work is by noting that Approach 2 only requires that the scoring series be bounded and non-decreasing after a certain known time in the derivation. For example, even if a parse involves a number of decreases early in the parse derivation, if after a certain point we can be certain that it is non-decreasing and bounded, we can simply use that bound and start eliminating poor candidates at that time (in this example, after t=2).

              
0    5    10 

This is very much possible in the Ubiquity parser as, given the Ubiquity Parser 2 design, the negative factors such as whether the parse has a verb from the input or not (step 2), whether multiple arguments are identified with the same semantic role (step 4), and how many of the verb’s arguments are in the input (step 4) can be identified early on in the derivation, all before the very computationally intensive step of nountype detection (step 7) and argument suggestion (step 8). In this way, we can front-load all the negative factors in scoring and continue to use a version of Approach 2 to optimize our parsing.

We can moreover make the effect of the negative factors be felt across the entire derivation by figuring the negative factors into a factor between 0 and 1 and multiplying it onto each of the positive factors being added. In other words, we can compute all the negative factors into a single score multiplier \mu_i \in [0,1] earlier in the derivation and then afterwards when adding up each of the positive factors simply applying that score multiplier to the score derivation:

\mu_{i}(\mbox{positive factor 0}) + \mu_{i}(\mbox{positive factor 1}) + \ldots \mu_{i}(\mbox{positive factor }m).

This model is what is going on under the hood in Ubiquity Parser 2. The Parser.Parse class has a property called .scoreMultiplier which contains the score multiplier \mu_i as described above. A method called .getMaxScore() is implemented in addition to .getScore() so that, even before all of the nountype suggestion scores have been computed (e.g., in the case of asynchronous suggestions) .getMaxScore() can be used as an upper bound B_i and compared to the in-progress scores of other candidates and lower candidates can thus be taken out of consideration earlier in the parse process.

Conclusion

In this blog post I’ve laid out a few different iterations of approaches I’ve thought of on the problem of scoring and ranking Ubiquity suggestions in a smart way. While some of the basic mechanisms of front-loading the negative factors into a scoreMultiplier and the computation of the maxScore (or upper bound) have been implemented, the actual optimization algorithm described here of removing parses from consideration earlier in the parser query has yet to be implemented in Ubiquity Parser 2 and I look forward to seeing it in action. In addition, there are surely factors I haven’t considered in the scoring or further tricks to improve the optimized scoring algorithm. I’d love to get your feedback and ideas on this topic. Thanks!


  1. In the case of Ubiquity Parser 2, we’ll let the “time” values t refer to the “steps” in the derivation, as laid out in the Ubiquity Parser 2 design. Note that these “steps” are currently done in parallel across all candidates in the current architecture, making the “time” analogy legitimate. I will thus use integer time values here, making this a discrete-time model. 

  2. Every Ubiquity parser query takes as a parameter the maximum number of suggestions to be returned. See the latest parser query interface proposal for details on this interface. 

  3. This naming is an homage to the rising sun lemma of Frigyes Riesz which uses a similar logic. The apparent connection to the fact that I am Japanese is purely coincidental. 

Related posts:

  1. Scoring and Ranking Suggestions
  2. Ubiquity Parser: The Next Generation Demo
  3. This week on Ubiquity Parser: The Next Generation
  4. Talking Ubiquity in Japan: 拡張機能勉強会にて発表
  5. A Demonstration of Ubiquity Parser 2

Related posts brought to you by Yet Another Related Posts Plugin.

Tags: , , , , , , , , , , ,

If you enjoyed this post, make sure you subscribe to my RSS feed (optionally with tweets from my Twitter)!

4 Responses to “Scoring for Optimization”

  1. Gordon P. Hemsley Says:

    Tsk tsk, Mitcho.

    http://www.mouserunner.com/blog/?p=38

  2. mitcho Says:

    I stand corrected. :)

  3. Judging Noun Types Says:

    […] exact suggestion and 0.1 or so is a very very improbable suggestion.3 These scores are used in the scoring of parses. Because verbs specify certain nountypes for each of their arguments, the scores that individual […]

  4. neil Says:

    If I understand correctly, the third model sort of takes both option 1 and option 2 into account. So you can start sorting by bottom up and testing for a threshold. If you look at it that way, you make sure you don't set a threshold too low or too high, because the raise the bar method accounts for that right?

Leave a Reply

mitcho.com uses Markdown with Smarty Pants for simple, lightweight markup.


© 2006–2011 mitcho (Michael 芳貴 Erlewine).
Proudly powered by WordPress on Media Temple.
The views expressed on these pages are mine alone and do not
reflect those of my employers and clients, past and present.