After listening to the last Stack Overflow podcast, Peter Norvig, the compact Pellon spell-checker intrigued me, so I decided to implement it in Scala if I could express it well in the functional Scala idiom, and also see how many lines of code it will take .
That is the whole problem. (Do not compare lines of code yet.)
(Two notes: you can run this in the Scala interpreter if you want. If you need a copy of the big.txt file or the entire project, on GitHub .)
import scala.io.Source val alphabet = "abcdefghijklmnopqrstuvwxyz" def train(text:String) = { "[az]+".r.findAllIn(text).foldLeft(Map[String, Int]() withDefaultValue 1) {(a, b) => a(b) = a(b) + 1} } val NWORDS = train(Source.fromFile("big.txt").getLines.mkString.toLowerCase) def known(words:Set[String]) = {Set.empty ++ (for(w <- words if NWORDS contains w) yield w)} def edits1(word:String) = { Set.empty ++ (for (i <- 0 until word.length) // Deletes yield (word take i) + (word drop (i + 1))) ++ (for (i <- 0 until word.length - 1) // Transposes yield (word take i) + word(i + 1) + word(i) + (word drop (i + 2))) ++ (for (i <- 0 until word.length; j <- alphabet) // Replaces yield (word take i) + j + (word drop (i+1))) ++ (for (i <- 0 until word.length; j <- alphabet) // Inserts yield (word take i) + j + (word drop i)) } def known_edits2(word:String) = {Set.empty ++ (for (e1 <- edits1(word); e2 <- edits1(e1) if NWORDS contains e2) yield e2)} def correct(word:String) = { val options = Seq(() => known(Set(word)), () => known(edits1(word)), () => known_edits2(word), () => Set(word)) val candidates = options.foldLeft(Set[String]()) {(a, b) => if (a.isEmpty) b() else a} candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b} }
In particular, I am wondering if it is possible to do something cleaner using the correct function. In the original Python implementation, the implementation is a bit clean:
def correct(word): candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word] return max(candidates, key=NWORDS.get)
Apparently, in Python, an empty set will be evaluated as Boolean False , so only the first candidate will be evaluated, returning a non-empty set, preserving the potentially costly calls edits1 and known_edits2 .
The only solution I came up with is the version you see here, where anonymous Seq functions are called until a non-empty Set is returned, which is guaranteed to be the last.
So experienced Scala-heads, is there a more syntactically concise or better way to do this? Thanks in advance!
python scala anonymous-function
Steven merrill
source share