Sunday, April 17, 2011

How to Write a Spelling Corrector? From Python to C# with Dynamic Sugar

There is a famous Python code sample named How to Write a Spelling Corrector?, written by Peter Norvig in 21 line of code (No counting the white line).

And different people have written their own version in different languages, see at the end of the Norvig's page, including 3 different versions written in C# by Lorenzo Stoakes.

I thought it would interresting to see what kind of result we could get using C# and Dynamic Sugar.net.

My C# version is 69 lines of code including white line, the Norvig's version is 30 lines. A ratio of 2.3 which is actually pretty good, it tells me that using DynamicSugar.net allowed me to reduce the number of line.
Counting line is definitely not a very relevant indicator except that it give you a ball park.

So here it is

First the python code

import re, collections

def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

My C# code

class SpellCorrector {

    Dictionary<string, int> DWORDS;
    const string            ALPHABET = "abcdefghijklmnopqrstuvwxyz";

    public SpellCorrector() {

        DWORDS = this.Train(this.Words(System.IO.File.ReadAllText("big.txt")));
    }
    List<string> Words(string text) {

        return Regex.Matches(text.ToLower(), "[a-z]+").ToList();
    }
    Dictionary<string, int> Train(List<string> features) {

        var model = new Dictionary<string, int>();
        foreach (var f in features)
            model[f] = model.Get(f, 0) + 1;
        return model;
    }
    class StringPair {

        public string a, b;
    }
    List<string> Edits1(string word) {
            
        var splits     = DS.Range(word.Length+1).Map(i => new StringPair { a = word.Slice(0,i), b = word.Slice(i) } );
        var deletes    = splits.Map(s => s.a + s.b.Slice(1));
        var transposes = splits.Map(s => s.b.Length > 1 ? s.a + s.b[1] + s.b[0] + s.b.Slice(2) : "**");
                    
        var replaces = new List<string>();
        foreach (var s in splits)
            foreach (var c in ALPHABET)
                if(s.b.Length>0)
                    replaces.Add(s.a + c + s.b.Slice(1, -1));

        var inserts = new List<string>();
        foreach (var s in splits)
            foreach (var c in ALPHABET)
                inserts.Add(s.a + c + s.b);
            
        return deletes.Add(transposes).Add(replaces).Add(inserts);
    }
    List<string> KnownEdits2(string word) {

        var l = new List<string>();
        foreach(var e1 in this.Edits1(word))
            foreach(var e2 in Edits1(e1))
                if(this.DWORDS.ContainsKey(e2))
                    l.Add(e2);
        return l;
    }
    List<string> Known(List<string> words){

        return words.Filter(w => this.DWORDS.ContainsKey(w));
    }
    public string Correct(string word){

        var candidateWords = this.Known(DS.List(word));
        if(candidateWords.Count==0)
            candidateWords = this.Known(this.Edits1(word));
        if(candidateWords.Count==0)
            candidateWords = this.KnownEdits2(word);
        if(candidateWords.Count==0)
            candidateWords = DS.List(word);

        return this.DWORDS.Max(candidateWords);         
    }
}

8 comments:

  1. wow.. Great code and very readable. Thank You!

    ReplyDelete
  2. Emad, Yes the all point of http://www.DynamicSugar.net is to write shorter and very readable C# code.

    ReplyDelete
  3. Hi

    Not able to compile above C# code in .NET 4.5
    Is there anything I'm missing here?

    ReplyDelete
  4. Install DynamicSugar via nuget
    (1) Menu Tools -> Library Package Manager -> Library Package Console
    (2) Install-package DynamicSugar
    (3) Paste the code in the file program.cs above the program class
    (4) Add this 2 using
    using System.Threading.Tasks;
    using DynamicSugar;

    it will compile.

    To run it you need the file big.txt, from http://norvig.com/big.txt.

    ReplyDelete
  5. Hello,

    First, thanks a lot for sharing your work, It's helping me a lot.

    I'm trying to correct the word ab-1234, while the right word would be ab-12345.

    I tried ALPHABET = "1234567890abcdefghijklmnopqrstuvwxyz-" and o f course, I put the word ab-12345 in the big.txt file, but it's not working.

    Coul you give me a help? I'm a newbie.

    ReplyDelete
  6. Hello! Do you have any licensing restrictions I should know about if I want to include a derivative of this code in my academic project?

    ReplyDelete
  7. This is a smart blog. I mean it. You have so much knowledge about this, and so much passion. You also know how to make people rally behind it, obviously from the responses.Well, I recommend a blog is my sentence correct for all students who wants to get best sentence corrector.

    ReplyDelete