Recursively split strings containing a specific set of prefixes - Python

If I have a list of prefix that can be attached to a string, how can I split a string, such as a prefix, and other characters in the next substring. For instance:

prefixes = ['over','under','re','un','co'] str1 = "overachieve" output: ["over","achieve"] str2 = "reundo" output = ["re","un","do"] 

Is there a better way to accomplish the above task, perhaps with regex or some string functions besides:

 str1 = "reundo" output = [] for x in [p for p in prefixes if p in str1]: output.append(x) str1 = str1.replace(x,"",1) output.append(str1) 
+6
source share
5 answers

Regular expressions are an effective way to find many alternative prefixes:

 import re def split_prefixes(word, prefixes): regex = re.compile('|'.join(sorted(prefixes, key=len, reverse=True))) result = [] i = 0 while True: mo = regex.match(word, i) if mo is None: result.append(word[i:]) return result result.append(mo.group()) i = mo.end() >>> prefixes = ['over', 'under', 're', 'un', 'co'] >>> for word in ['overachieve', 'reundo', 'empire', 'coprocessor']: print word, '-->', split_prefixes(word, prefixes) overachieve --> ['over', 'achieve'] reundo --> ['re', 'un', 'do'] empire --> ['empire'] coprocessor --> ['co', 'processor'] 
+5
source

I would use the str.startswith method

 for p in prefixes: if str1.startswith(p): output.append(p) str1 = str1.replace(p, '', 1) output.append(str1) 

The biggest flaw in your code is that strings such as 'found' will ['un', 'fod'] .

However, if you have a hypothetical string 'reuncoundo' , you will need to 'reuncoundo' over the list more than once.

 while True: if not any(str1.startswith(i) for i in prefixes): output.append(str1) break for p in prefixes: if str1.startswith(p): output.append(p) str1 = str1.replace(p, '', 1) 

It outputs ['re', 'un', 'co', 'un', 'do']

+1
source
 prefixes = ['over','under','re','un','co'] def test(string, prefixes, existing=None): prefixes.sort(key = lambda s: len(s)) prefixes.reverse() # This and the previous line ensure that longer prefixes are searched first regardless of initial sorting. if existing is None: existing = [] # deals with the fact that placing [] as a default parameter and modifying it modifies it for the entire session for prefix in prefixes: if string.startswith(prefix): existing.append(prefix) return test(string[len(prefix):], prefixes, existing) existing.append(string) return existing 

This code goes through the line recursively, removing known prefixes, until it ends, and then returns the entire list. On longer lines, the generator is probably the best route, but on shorter lines the absence of the need for additional generator overhead can make this a better solution.

+1
source

Bearing in mind the proverb “two problems”, I would still say that this is work for regular expression. Regular commands are compiled into final machines, which check all possible options in parallel, and not one by one.

Here's an implementation that uses the following:

 import re def split_string(string, prefixes): regex = re.compile('|'.join(map(re.escape, prefixes))) # (1) while True: match = regex.match(string) if not match: break end = match.end() yield string[:end] string = string[end:] if string: yield string # (2) prefixes = ['over','under','re','un','co'] assert (list(split_string('recouncoundo',prefixes)) == ['re','co','un','co','un','do']) 

Notice how the regular expression in (1) is built:

  • prefixes are escaped using re.escape , so special characters do not interfere
  • escaped prefixes are combined using the | (or) regex
  • all information is compiled.

Line (2) gives the final word if any of those remaining after the separation prefixes. You may need to remove the if string check if string you want the function to return an empty string if there is nothing left after removing the prefix.

Also note that re.match (unlike re.search ) only searches for the pattern at the beginning of the input line, so there is no need to add ^ to the regular expression.

+1
source

If you are dealing with prefixes, you do not need a regex, you only need startswith() . Of course, you can use a regex, but it's harder to read and maintain, even for simple. startswith() easier in my opinion.

And the other answers seem too complicated for such a simple problem. I would suggest a recursive function like this:

 def split_prefixes (word, prefixes): split = [p for p in prefixes if word.startswith(p)] if split: return split + split_prefixes (word[len(split[0]):], prefixes) else: return [word] 

This is the result:

 "overachieve" -> ['over', 'achieve'] "reundo" -> ['re', 'un', 'do'] "reuncoundo" -> ['re', 'un', 'co', 'un', 'do'] "empire" -> ['empire'] 
+1
source

All Articles