I would simplify the task in two parts,
- Search for candidate words for breaking a string into (so hllwrld => hll wrld)
- How then to expand these words by adding vowels.
First I will take a dictionary of vocabulary frequencies and process it to create a second list of words without vowels, as well as a list of possible vowel lists for each collapsed word (and corresponding frequency). Technically, you donβt need a GA to solve this problem (and I think it would be easier to solve without it), but as you asked, I will give 2 answers:
Without GA: you must solve the first problem using the first depth search, matching the substrings of the word with this dictionary and doing this with the remaining parts of the word, only taking word-for-word sections (without vowels) where all the words are in the second dictionary. Then you should replace the vowels. Given that the second dictionary and section that you already have, this should be easy. You can also use the vowel list to further limit the separation, since valid words in sections can only be made whole using vowels from the list of vowels that are entered into the algorithm. Starting from the left side of the line and iterating over all valid sections in depth in the first order, you should solve this problem relatively quickly.
With GA . To solve this problem with GA, I would create a dictionary of words without vowels. Then, using GA, create binary strings (like your chromosomes) of the same length as the input consonant string, where 1 = split the word at this position and 0 = leave it unchanged. These lines will have the same length. Then create a fitness function that returns the proportion of words obtained after performing the separation using the chromosome, which are real words without vowels, according to this dictionary. Create a second fitness function that accepts real words without vowels and calculates the proportion of overlap between the vowels that are absent in all these virtual words without vowels, and the original list of vowels. Combine both fitness functions into one, multiplying the value from the first by ten (provided that the second returns a value from 0 to 1). This will force the algorithm to first focus on the problem of segmentation and the second problem of inserting vowels, and will also facilitate segmentation, which are of the same quality, but prefer those that have a closer set of missing vowels in the original list. I would also include a cross in the solution. Since all of your chromosomes are the same length, this should be trivial. When you have a solution that perfectly evaluates the fitness function, then it should be trivial to recreate the original sentence given by this dictionary of non-vowel words (provided that you support a second dictionary that lists possible missing vowels for each unspoken word - for each be multiple, as some vowels will be the same as deleted vowels.
Simon
source share