And different people have written their own version in different languages, see at the end of the Norvig's page, including 3 different versions written in C# by Lorenzo Stoakes.
I thought it would interresting to see what kind of result we could get using C# and Dynamic Sugar.net.
My C# version is 69 lines of code including white line, the Norvig's version is 30 lines. A ratio of 2.3 which is actually pretty good, it tells me that using DynamicSugar.net allowed me to reduce the number of line.
Counting line is definitely not a very relevant indicator except that it give you a ball park.
So here it is
First the python code
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
My C# code
class SpellCorrector {
Dictionary<string, int> DWORDS;
const string ALPHABET = "abcdefghijklmnopqrstuvwxyz";
public SpellCorrector() {
DWORDS = this.Train(this.Words(System.IO.File.ReadAllText("big.txt")));
}
List<string> Words(string text) {
return Regex.Matches(text.ToLower(), "[a-z]+").ToList();
}
Dictionary<string, int> Train(List<string> features) {
var model = new Dictionary<string, int>();
foreach (var f in features)
model[f] = model.Get(f, 0) + 1;
return model;
}
class StringPair {
public string a, b;
}
List<string> Edits1(string word) {
var splits = DS.Range(word.Length+1).Map(i => new StringPair { a = word.Slice(0,i), b = word.Slice(i) } );
var deletes = splits.Map(s => s.a + s.b.Slice(1));
var transposes = splits.Map(s => s.b.Length > 1 ? s.a + s.b[1] + s.b[0] + s.b.Slice(2) : "**");
var replaces = new List<string>();
foreach (var s in splits)
foreach (var c in ALPHABET)
if(s.b.Length>0)
replaces.Add(s.a + c + s.b.Slice(1, -1));
var inserts = new List<string>();
foreach (var s in splits)
foreach (var c in ALPHABET)
inserts.Add(s.a + c + s.b);
return deletes.Add(transposes).Add(replaces).Add(inserts);
}
List<string> KnownEdits2(string word) {
var l = new List<string>();
foreach(var e1 in this.Edits1(word))
foreach(var e2 in Edits1(e1))
if(this.DWORDS.ContainsKey(e2))
l.Add(e2);
return l;
}
List<string> Known(List<string> words){
return words.Filter(w => this.DWORDS.ContainsKey(w));
}
public string Correct(string word){
var candidateWords = this.Known(DS.List(word));
if(candidateWords.Count==0)
candidateWords = this.Known(this.Edits1(word));
if(candidateWords.Count==0)
candidateWords = this.KnownEdits2(word);
if(candidateWords.Count==0)
candidateWords = DS.List(word);
return this.DWORDS.Max(candidateWords);
}
}
wow.. Great code and very readable. Thank You!
ReplyDeleteEmad, Yes the all point of http://www.DynamicSugar.net is to write shorter and very readable C# code.
ReplyDeleteHi
ReplyDeleteNot able to compile above C# code in .NET 4.5
Is there anything I'm missing here?
Install DynamicSugar via nuget
ReplyDelete(1) Menu Tools -> Library Package Manager -> Library Package Console
(2) Install-package DynamicSugar
(3) Paste the code in the file program.cs above the program class
(4) Add this 2 using
using System.Threading.Tasks;
using DynamicSugar;
it will compile.
To run it you need the file big.txt, from http://norvig.com/big.txt.
Hello,
ReplyDeleteFirst, thanks a lot for sharing your work, It's helping me a lot.
I'm trying to correct the word ab-1234, while the right word would be ab-12345.
I tried ALPHABET = "1234567890abcdefghijklmnopqrstuvwxyz-" and o f course, I put the word ab-12345 in the big.txt file, but it's not working.
Coul you give me a help? I'm a newbie.