20100216

Producing candidates for a spelling corrector

The next step to implement the Bloom Filter-based spelling corrector is to create and check hash values for candidates against the filter without actually creating candidate instances. If the filter step succeeds the candidate String instances are created in the next step.

All candidate iterators (for insertion, deletion, transposition and replacement) are based on the CandidateIterator trait. The CandidateIterator can return a hash value for a candidate with the hashValue method and a candidate string with the candidate method. The next method scrolls to the next value.
trait CandidateIterator{
  def next : Boolean
  def candidate : String
  def hashValue : Int
}
An instructive implementation is the replacement CandidateIterator. It replaces each character of the base string with a character from the alphabet to create a candidate.

The candidate method does exactly this for candidate strings. It takes the base string and replaces the character at the idx position with the character from the alphabet at the alphabetIdx position (the alphabet is defined as val alphabet = 'a' to 'z' toArray).
def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
The hashValue method is not as easily readable as the candidate method. The implementation is based on the insight that the hash function for Strings is the defined in a way that we can reused a base hash value for the a substring base.substring(0, idx). With that in place we can save about half of the hash computations. This has huge performance benefits over an implementation that works directly with Strings where hash values have to be computed from the scratch for every candidate. Analogous to the String copy costs the hashing costs increase nearly linearly with the String length. The performance benefits will be especially relevant for long Strings that will then produce a lot of candidate strings.

The next method takes care of the idx, alphabetIdx to point to valid positions. The baseHash value is calculated iteratively in the next method from the baseHash of the last round if the idx value changes.

The hashValue implementation is now based on the baseHash value. The hash is computed for the character taken from the alphabet and the remaining string base.substring(idx + 1, base.length).
class Replacement(base : String) extends BaseCandidateIterator(base){
  private var idx = 0
  private var alphabetIdx = -1
  private var baseHash = 0
  def next = {
    alphabetIdx += 1
    if(alphabetIdx == Dictionary.alphabet.length){
      idx += 1
      alphabetIdx = 0
      baseHash = hashAt(baseHash, idx - 1)
    }
    idx < base.length
  }
  def candidate = (base take idx) + Dictionary.alphabet(alphabetIdx) + (base drop idx + 1)
  def hashValue = {
    val pivot = hashFor(baseHash, Dictionary.alphabet(alphabetIdx))
    val right = hashRange(pivot, idx + 1, base.length)
    right
  }
}
The CandidateIterator let's us iterate over candidate hash values without creating objects. Externally to the CandidateIterators the filtering takes places calling the candidate to produce an instance method only when the hash value was in the filter. This work is done by the CandidateFilter.
class CandidateFilter(base : CandidateIterator, filter : Int => Boolean) extends Iterator[String]{
  private var more : Boolean = false
  def hasNext : Boolean = {
    def hasMore : Boolean = {
      while(base.next) if(filter(base.hashValue)) return true
      false
    }
    more ||= hasMore
    more
  }

  def next : String = {
    more = false
    base.candidate
  }
}
The class takes a CandidateIterator and a filter. It implements the Iterator trait. Values returned by the CandidateFilter are candidate strings that passed the Bloom Filter test.

Users of the CandidateFilter and CandidateIterator can now use the Iterator convenience methods and for expressions to work with candidate strings as usual.

With the CandidateIterator implementation and the CandidateFilter in place the spelling corrector can be implementation in the final step.

No comments:

Post a Comment