Algorithm for the largest word formed from the elements of the conversion table

I want to write an algorithm for the following problem scenario

Accepting the names of the periodic table names, find the largest word that can be formed? Symbols, such as Na , Ne , etc., should be considered as separate elements.

This was asked in a famous interview with the company. Can anyone help me solve this.

+7
algorithm word
source share
6 answers

How to express this word as a chemical compound? A dynamic programming solution is proposed here. The important line is "Let progress [i] be the solution to the subtask for the word [: i]." The rest follows.

 elements = "H He Li Be BCNOF Ne Na Mg Al Si PS Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt Ds Rg Cn Uut Fl Uup Lv Uus Uuo".split() def decompose(word): """Express given word as chemical compound. If there are multiple solutions, return one of minimal weight.""" progress = [False for x in range(len(word)+1)] # solution for word[:i] progress[0] = [] for i in range(1, len(word)+1): possibles = list() for j in range(max(i-3,0), i): if progress[j] == False: continue alchemical = word[j:i].title() if alchemical in elements: possibles.append(progress[j] + [alchemical]) if possibles: # choose minimal solution progress[i] = min(possibles, key=len) if progress[-1] == False: return False return "".join(progress[-1]) assert decompose("sine") == "SiNe" # rather than SI-Ne assert decompose("bismuth") == "BiSmUTh" assert decompose("jam") == False 

https://gist.github.com/hickford/750135db633832913703


Ran is the whole dictionary, the longest connections were:

  • non-representativeness of NoNRePReSeNTaTiONaLiSm
  • thermophosphorescence ThErMoPHOsPHoReSCeNCe
  • bronchoesophagoscopy BrONCHoEsOPHAgOsCoPY
  • hyperphosphorescence of HYPErPHOsPHoReSCeNCe
  • hypsibrachycephalism HYPSiBrAcHYCePHAlISm
+1
source share

I think the best way is to check every word in the dictionary and see if this can be done from the names of the elements. Checking every permutation of elements would be more difficult to program and would be less efficient.

While I agree, it is easier to create combinations, there are too many of them, and, as you said, tend to infinity if you do not give a limit. Producing a word with characters would be a little more complicated and clever, but I don't think it would be too complicated.

eg. when you get a word, you can search for elements that look for elements that can make up your word, and then using this set of elements, try and fill in the letters from beginning to end. Obviously, it gets harder with elements that are not two letters and words that are odd in length.

In fact, you can use a kind of graph. Say you have silicon. You can start with the letter "S" or "SI", which are in the table. From there, select "SI" because it is closer to your solution. If "SI" does not lead to your decision, you will have to go back and see if "S" will work.

Thus, it works as a kind of search for depth at the beginning.

+3
source share

Creating all the words and checking their existence is impractical. There are 118 ^ L words of length L, a function that is growing too fast. 1,643,032 three characters of the word!

Another way, as Makoto suggests, is much more efficient. Try to recover each word from the dictionary. There are about 250,000 English words.

Checking the word is simple: see if any character of the element matches the beginning of the word, and continue with the remaining characters.

You should try all possible matches, since the match may hide something else. (I mean that the word ABC can be matched by A and then stuck because BC is not available, but it may be that AB and C are.) Thus, the matching function will be recursive. I do not expect an exponential explosion in this matching process.

For maximum efficiency, you will save the symbols in the trie structure http://en.wikipedia.org/wiki/Trie .

Final hint: since you just need to find the longest match, try the words, reduce the length and stop in the first match.

Here is a simple solution in Python without optimization. Matching is performed from right to left to allow printing of a sequence of characters in a post-order:

 Words= ['K', 'NaH', 'NaNaNaNa', 'HaKuNa'] Symbols= ['Na','K','H'] def Match(Word): # Match the end of the word with every symbol for S in Symbols: # In case of a partial match, recurse on the prefix if S == Word or (S == Word[-len(S):] and Match(Word[:-len(S)])): print S, return True # No match, report failure return False for W in Words: if Match(W): print >>> K Na H Na Na Na Na 
+1
source share

To create all possible words is a naive approach.

Based on Generate all permutations of all lengths :

 import itertools.. symbols = ['Na','K','H'] for i in range(len(symbols)): for word in itertools.permutations(symbols,i+1): print( ''.join(word) ) 

You can create any possible combination and check it for a dictionary, if it is a keyword. But it is inefficient and only because you do not allow repetition of characters.

Check if a word can be built from characters - still not perfect.

If you allow repetitions, you need to check the word for a list of characters. I suggest the following:

 import itertools.. words = ['K', 'NaH', 'NaNaNaNa', 'HaKuNa'] symbols = ['Na','K','H'] for i in range(len(symbols)): for word in itertools.permutations(symbols,i+1): print( ''.join(word) ) def can_be_built(word): pos = 0 ret = True while(pos < len(word)): #following loop checks if word starting form `pos` #can be prefixed by a symbol symbol_match = False for symbol in symbols: if word[pos:pos+len(symbol)] == symbol: symbol_match = True break if symbol_match == False: print('Word `{}` has no match for symbol from: {}'.format(word, word[pos:])) ret = False break #if it does move forward pos += len(symbol) return ret for word in words: print("{} can be built? {}".format(word, can_be_built(word))) 

Iteratively checks if the word prefix is ​​a character, then moves forward until the end of the word is reached.

Output:

 K can be built? True NaH can be built? True NaNaNaNa can be built? True Word `HaKuNa` has no match for symbol from: aKuNa HaKuNa can be built? False 

This is not perfect yet.

The right approach

According to Makoto, the prefix check should return a list of all possible matches. The algorithm should make a queue of these matches and check all possible paths. This would be like creating a schedule of prefixes matching the word. And if you are building a whole word, you are at home.

I think that it is still quite easy to fix my second example, but I miss the real code. I think this is a good place to start.

0
source share

[EDIT: This post is just a complete explanation of the DP algorithm mentioned by Niklas B. in the comments on other posts.]

Testing a specific word in linear time

In a word that can be formed from the names of chemical elements, at least one of the following must be true:

  • The last letter is the one-letter name of the chemical element. And the prefix consisting of all letters, with the exception of the last one, is in itself a word that can be formed from the names of chemical elements.
  • The last two letters are the two-letter name of the chemical element. The prefix consisting of all letters, with the exception of the last two, is itself a word that can be formed from the names of chemical elements.

This indicates a good DP algorithm, where for a given word w of length k, we calculate for all 1 <= i <= k the largest number of letters (or chemical symbols, the question is ambiguous regarding exactly what we are trying to maximize!) From which we can construct a length prefix -i from w. Let me call it f (i) and say that f (i) = -1 if the prefix of length-iw cannot be created from the names of chemical elements at all.

Let X = 1 to maximize the names of the elements or 2 for the maximum number of letters.

Recursion for DP can be written as:

  • one (i) = if (w [i] is the 1-letter name of the element) {f (i-1) + 1} else {-1}
  • two (i) = if ("w [i-1] w [i]" is the 2-letter name of the element) {f (i-2) + X} else {-1}
  • f (i) = -1 if i <0
  • f (0) = 0
  • f (i) = max (one (i), two (i)) if i> 0

Then f (k) will be what we want to know: the maximum number (letters / elements) from which w can be formed, or -1 if this is not possible.

The maximum value () means that if only one of the methods for processing the end of the prefix works, this method will be selected (since the other method will have a value of -1, which will always be less compared), and if none of the methods works, we will correctly get f (i) = -1.

Assuming constant testing of whether single characters or pairs of characters are real names of chemical elements (using, for example, arrays or hash tables), we can calculate whether a given word can be represented as a sequence of names of chemical elements in time and space in proportion to its length .

Given all the words

If we maximize the number of letters, then it probably makes sense to process the dictionary in decreasing order of length, since for the first time we come across a word for which f (k) is not -1, we know this should be the longest, and we can stop.

This can be adapted to maximize the number of element names. Although we cannot stop right away because it may be a shorter word, nevertheless it can be formed from more element names (in particular, using more single-character), we still get useful information whenever we find a new one a word with more elements than the previous one: we can update the cutoff threshold with this number of elements and stop as soon as we see a word that has fewer letters than this threshold, since a word with a length of -m can never be composed of more than m element names.

There is another approach that may turn out to be faster: first sorting the dictionary in alphabetical order, we can avoid rearranging f (i) in the prefix that is shared with the previous word. For example. if carton immediately follows cart in the dictionary, then we only need to calculate f (i) in the on part.

0
source share

Sorry, I am very confused by these answers. Of course, the obvious answer is to sort the dictionary in alphabetical order using nested bucket sortings to a certain depth, for example, 8 letters, this should give you an order of 1 search for words starting with a given sequence of letters up to 8.

Then go through the period tables that make DFS, like this game tree. At each step, you add an item and then check the list to see if there are any words starting with this combination.

This will be relatively intense with memory, since you will probably need at least 6 layers of buckets (in alphabetical order by the first six letters), but usually just 1,000,000 words. Since almost every branch in the game tree ends after four letters, you will quickly search for DFS in the entire game tree. If you can form each word in a dictionary, it should still be the case that there are no more than the same number of branches than there are words, and in practice you can only get a part of the word int it dictionary, so this approach should be effective.

-one
source share

All Articles