I thought I would share this slightly interesting trick, although it requires even more code than the rest, and in fact it is not "pythonic". This will require much more code than other solutions, but should be pretty fast if I look at the time others need.
We do a little preprocessing to speed up the calculations. The basic approach is this: each letter in the alphabet assigns a prime number. For example. A = 2, B = 3, etc. Then we calculate the hash for each word in the alphabet, which is simply the product of simple representations of each character in the word. Then we save each word in a hash indexed dictionary.
Now, if we want to know which words are equivalent, say a textbook
, we need to calculate the same hash for the word and find it in our dictionary. Usually (say, in C ++) we have to worry about overflows, but in python it is even simpler: every word in the list with the same index will contain exactly the same characters.
Here is a code with a little optimization, which in our case we only need to worry about the characters that also appear in this word, which means that we can do with a much smaller prime table than otherwise (the obvious optimization will only assign characters that meanings appear in the word in general - it was fast enough, so I wasn’t worried, and so we could pre-process only once and do it for several words). The simplest algorithm is useful quite often, so you should have it yourself;)
from collections import defaultdict from itertools import permutations PRIMES = list(gen_primes(256)) # some arbitrary prime generator def get_dict(path): res = defaultdict(list) with open(path, "r") as file: for line in file.readlines(): word = line.strip().upper() hash = compute_hash(word) res[hash].append(word) return res def compute_hash(word): hash = 1 for char in word: try: hash *= PRIMES[ord(char) - ord(' ')] except IndexError: # contains some character out of range - always 0 for our purposes return 0 return hash def get_result(path, given_word): words = get_dict(path) given_word = given_word.upper() result = set() powerset = lambda x: powerset(x[1:]) + [x[:1] + y for y in powerset(x[1:])] if x else [x] for word in (word for word in powerset(given_word) if len(word) >= 4): hash = compute_hash(word) for equiv in words[hash]: result.add(equiv) return result if __name__ == '__main__': path = "dict.txt" given_word = "textbook" result = get_result(path, given_word) print(result)
Running on my ubuntu wordlist (98k words) is pretty fast, but not what I would call pythonic, since it is basically a port of the C ++ algorithm. Useful if you want to compare multiple words in this way.
Voo
source share