One possible hash function can be (assuming only English words) a sorted count of the number of occurrences of each letter. Therefore, for an βanagram,β you must generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r', 1)].
Alternatively, you can get an inaccurate grouping by creating a bitmask from your word, where for bits 0-25 each bit represented the presence or absence of this letter (bit 0, representing βaβ to bit 25, representing βzβ). But then you will have to do a little more processing to separate each hashed group in order to distinguish, for example. "to" from "too".
Does any of these ideas help? Any specific implementation language in mind (could I do C ++, python or Scala)?
Edit: added example Scala code and output:
OK: I'm in Scala mode now, so I did something to do what you ask for, but (um) this may not be entirely clear if you are not familiar with Scala or functional programming.
Using a large list of English words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt
I run this Scala code on them (it takes about 5 seconds using Scala 2.9 in script mode, including compilation time, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).
// Hashing function to go from a word to a sorted list of letter counts def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1) // Read all words from file, one word per line val lines = scala.io.Source.fromFile("2of12.txt").getLines // Go from list of words to list of (hashed word, word) val hashed = lines.map( l => (toHash(l), l) ).toList // Group all the words by hash (hence group all anagrams together) val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) ) // Sort the resultant anagram sets so the largest come first val sorted = grouped.toList.sortWith( _._2.size > _._2.size ) for ( set <- sorted.slice(0, 10) ) { println( set._2 ) }
This unloads the first 10 sets of anagrams (the set with the most members):
List(caret, cater, crate, react, trace) List(reins, resin, rinse, risen, siren) List(luster, result, rustle, sutler, ulster) List(astir, sitar, stair, stria, tarsi) List(latrine, ratline, reliant, retinal) List(caper, crape, pacer, recap) List(merit, miter, remit, timer) List(notes, onset, steno, stone) List(lair, liar, lira, rail) List(drawer, redraw, reward, warder)
Note that this uses the first sentence (list of letter counts) a no more complex bitmask method.
Edit 2: you can replace the hash function with simple sorting on the characters of each word (as suggested by JAB) and get the same result with a clearer / faster code:
def toHash(b:String) = b.toList.sortWith(_<_)