Get list of anagrams from dictionary

Question

Get list of anagrams from dictionary

Basically, Anagrams are like rearranging string.Eg stack , sackt , stakc all are stack anagrams (thoughts above words don't make sense). In any case, you could understand what I had in mind.

Now I want the list of anagrams give a million words or just say from the dictionary.

My main question is: Find total number of unique anagrams in a dictionary?

Sorting and comparing will not work as the time complexity is pretty bad.

I thought to use a hash table, a row as a key.

But the problem is what should be the hash function? It would be helpful if some kind of pseudo code is provided. Some other approaches that are better than the mentioned approaches will also be useful.

Thanks.

+5

data-structures hash anagram

vijay Jun 19 '12 at 20:05

source share

5 answers

One possible hash function can be (assuming only English words) a sorted count of the number of occurrences of each letter. Therefore, for an “anagram,” you must generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r', 1)].

Alternatively, you can get an inaccurate grouping by creating a bitmask from your word, where for bits 0-25 each bit represented the presence or absence of this letter (bit 0, representing “a” to bit 25, representing “z”). But then you will have to do a little more processing to separate each hashed group in order to distinguish, for example. "to" from "too".

Does any of these ideas help? Any specific implementation language in mind (could I do C ++, python or Scala)?

Edit: added example Scala code and output:

OK: I'm in Scala mode now, so I did something to do what you ask for, but (um) this may not be entirely clear if you are not familiar with Scala or functional programming.

Using a large list of English words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt

I run this Scala code on them (it takes about 5 seconds using Scala 2.9 in script mode, including compilation time, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).

 // Hashing function to go from a word to a sorted list of letter counts def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1) // Read all words from file, one word per line val lines = scala.io.Source.fromFile("2of12.txt").getLines // Go from list of words to list of (hashed word, word) val hashed = lines.map( l => (toHash(l), l) ).toList // Group all the words by hash (hence group all anagrams together) val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) ) // Sort the resultant anagram sets so the largest come first val sorted = grouped.toList.sortWith( _._2.size > _._2.size ) for ( set <- sorted.slice(0, 10) ) { println( set._2 ) }

This unloads the first 10 sets of anagrams (the set with the most members):

 List(caret, cater, crate, react, trace) List(reins, resin, rinse, risen, siren) List(luster, result, rustle, sutler, ulster) List(astir, sitar, stair, stria, tarsi) List(latrine, ratline, reliant, retinal) List(caper, crape, pacer, recap) List(merit, miter, remit, timer) List(notes, onset, steno, stone) List(lair, liar, lira, rail) List(drawer, redraw, reward, warder)

Note that this uses the first sentence (list of letter counts) a no more complex bitmask method.

Edit 2: you can replace the hash function with simple sorting on the characters of each word (as suggested by JAB) and get the same result with a clearer / faster code:

 def toHash(b:String) = b.toList.sortWith(_<_)

+2

Alex wilson Jun 19 '12 at 20:17

source share

If you have XOR — the hash code values of each character, and then XOR — the result by the input length, you will get the same value regardless of the word order, which means that all anagrams will produce the same hash. (XORing in length does not allow the "boss" and "bo" to return the same value, because the hash of the "s" against itself is always 0.)

Example:

 int AnagramHash(string input) { int output = 0; foreach(char c in input) output ^= c.GetHashCode(); return output ^ input.Length; }

You still have to look for all the words with the same AnagramHash. I would update the dictionary table using a hash field (regardless of your algorithm) to reduce the overall calculation.

EDIT: Also, as an extra note, XOR is the simplest operation performed by ALU, so if you end up using it, you can quickly generate your hashes.

+1

Steve konves Jun 19 '12 at 20:33

source share

Sorting and comparing will not work as the time complexity is pretty bad.

When exchanging the time complexity for additional memory, just store the letter count in the word in 26- char (or the equivalent in any language you use, and assume that you use the Latin alphabet and only alphabetic characters) the array and the hash of the array. You are stuck with O (n) time relative to word length, but most English words are actually not that long.

eg. stack , sackt and stakc will have an array with locations s , t , a , c , k == 1, and the rest will all be set to 0.

Based on your comment, which implies that you're really fine with sorting the characters of the word, until you sort the words yourself, you could do something even simpler than answer Alex, and just sort the characters in the word string and hash results. (larsmans said this first, but didn't post it as an answer, so ...)

0

Jab Jun 19 '12 at 20:18

source share

Use hashmap with a string as a key and list (string) as a value, where the list of strings contains all the anagrams of the key string.

The question is similar to "find all anagrams of a word in a file"

View algo and code here http://justprogrammng.blogspot.com/2012/06/determine-anagrams-of-word-in-file.html

0

sachin Jun 22 '12 at 15:52

source share

wildplasser · Accepted Answer · 2012-06-20T10:07:19+0000

The obvious solution is to map each character to a prime and multiply primes. So, if "a" → 2 and "b" → 3, then

'ab' → 6
'ba' → 6
'bab' → 18
'abba' → 36
'baba' → 36

To minimize the chance of overflow, the smallest primes can be assigned to more frequent letters (e, t, i, a, n). Note: the 26th is 101.

UPDATE: here you can find the implementation

Get list of anagrams from dictionary

More articles: