Finding the best estimation method for a genetic algorithm

I am currently trying to solve hard Challenge # 151 on reddit using a non-invasive method, a genetic algorithm.

In short, by dividing the string into consonants and vowels and deleting spaces , I need to assemble it without knowing which character will be first.

hello world is split into hllwrld and eoo and needs to be reassembled. One solution, for example, would be hlelworlod , but that doesn't make much sense. A comprehensive approach that uses all possible solutions works, but is not possible for more complex sets of tasks.

What i already have

  • English word frequency database
  • An algorithm that creates a relative cost database using the Zipf law and can sequentially separate words from sentences without spaces (borrowed from this question / answer
  • A method that pushes consonants and vowels on a stack and randomly accepts a single character and encodes it in a string consisting of 1 and 2 , effectively encoding the construct in gene . The correct gene for an example would be 1211212111
  • A method that mutates such a string by randomly changing characters around

What i tried

Generating 500 random sequences using the infer_spaces() method and evaluating suitability with the cost of all words, taking the best 25% and changing 4 new ones, works for small lines, but very often falls into local minima, especially for longer sequences. Hello World was found in the first generation, thisisnotworkingverygood (which is correctly divided and costs 41.223 ) converges to th iss n ti wo or king v rye good (270 cost) in the second generation.

What I need

Obviously, estimated cost as a valuation method only works to separate sentences that are grammatically correct, and not for this genetic algorithm. Do you have any better ideas that I could try? Or is this another part of the solution, such as introducing gene , the problem?

+7
python algorithm genetic-algorithm
source share
3 answers

I would simplify the task in two parts,

  • Search for candidate words for breaking a string into (so hllwrld => hll wrld)
  • How then to expand these words by adding vowels.

First I will take a dictionary of vocabulary frequencies and process it to create a second list of words without vowels, as well as a list of possible vowel lists for each collapsed word (and corresponding frequency). Technically, you don’t need a GA to solve this problem (and I think it would be easier to solve without it), but as you asked, I will give 2 answers:

Without GA: you must solve the first problem using the first depth search, matching the substrings of the word with this dictionary and doing this with the remaining parts of the word, only taking word-for-word sections (without vowels) where all the words are in the second dictionary. Then you should replace the vowels. Given that the second dictionary and section that you already have, this should be easy. You can also use the vowel list to further limit the separation, since valid words in sections can only be made whole using vowels from the list of vowels that are entered into the algorithm. Starting from the left side of the line and iterating over all valid sections in depth in the first order, you should solve this problem relatively quickly.

With GA . To solve this problem with GA, I would create a dictionary of words without vowels. Then, using GA, create binary strings (like your chromosomes) of the same length as the input consonant string, where 1 = split the word at this position and 0 = leave it unchanged. These lines will have the same length. Then create a fitness function that returns the proportion of words obtained after performing the separation using the chromosome, which are real words without vowels, according to this dictionary. Create a second fitness function that accepts real words without vowels and calculates the proportion of overlap between the vowels that are absent in all these virtual words without vowels, and the original list of vowels. Combine both fitness functions into one, multiplying the value from the first by ten (provided that the second returns a value from 0 to 1). This will force the algorithm to first focus on the problem of segmentation and the second problem of inserting vowels, and will also facilitate segmentation, which are of the same quality, but prefer those that have a closer set of missing vowels in the original list. I would also include a cross in the solution. Since all of your chromosomes are the same length, this should be trivial. When you have a solution that perfectly evaluates the fitness function, then it should be trivial to recreate the original sentence given by this dictionary of non-vowel words (provided that you support a second dictionary that lists possible missing vowels for each unspoken word - for each be multiple, as some vowels will be the same as deleted vowels.

+5
source share

Let's say you have several generations, and you are building value for the best sample in each generation (we are considering a long offer). Does the graph go down or converge after 2-3 generations to a certain value (let the algorithm work, for example, for 10 generations)? Can you run your algorithm several times with different initial conditions (random sequences) and see if you sometimes get good results or not?

Depending on the results, you might try the following (this graph is a really good tool for increasing productivity):

1) If you have a chart that grows up and down too much - you have too many mutations (average number of swaps per gene, for example), try reducing it.

2) If you are stuck in a local minimum (the cost of the best sample will not change significantly after a while), try to increase the mutation or run several isolated populations (3-4) of, say, 100 species at the beginning of your algorithm for several generations. Then select the best population (which is closer to the global minimum) and try to improve it as much as possible with the help of a mutation

PS: By the way, an interesting problem, I tried to figure out how to use a crossover to improve the algorithm, but did not understand it

+1
source share

Fitness functionality is key to the success of the GA algorithm (for which I agree here).

I agree with @Simon that splitting vowels is not that important. just open your text body to remove the vowels.

what is important in fitness:

  • matching word frequency (more common words)
  • Grammar - sentence structure (for which you may need to use NLTK to get related information)

and don't forget to update the end result ^^

+1
source share

All Articles