Is there a version of the `string.split ()` generator in Python?

string.split() returns an instance of the list. Is there a version that returns a generator ? Are there any reasons against having a generator version?

+92
python generator string
Oct 05 2018-10-10T00:
source share
15 answers

It is very likely that re.finditer uses a fairly minimal amount of memory overhead.

 def split_iter(string): return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string)) 

Demo:

 >>> list( split_iter("A programmer RegEx test.") ) ['A', "programmer's", 'RegEx', 'test'] 

edit: I just confirmed that it takes up persistent memory in python 3.2.1, assuming my testing technique is correct. I created a very large string (1 GB or so), and then repeated it through an iterable using a for loop (NOT a list comprehension that would create additional memory). This did not lead to a noticeable increase in memory (that is, if there was an increase in memory, it would be much smaller than the 1 GB line).

+54
Mar 19 '12 at 12:41
source share

The most efficient way I can write it is to write it using the offset parameter of the str.find() method. This avoids a lot of memory usage and relies on the overhead of regex when not needed.

[edit 2016-8-2: updated this to optionally support regex separators]

 def isplit(source, sep=None, regex=False): """ generator version of str.split() :param source: source string (unicode or bytes) :param sep: separator to split on. :param regex: if True, will treat sep as regular expression. :returns: generator yielding elements of string. """ if sep is None: # mimic default python behavior source = source.strip() sep = "\\s+" if isinstance(source, bytes): sep = sep.encode("ascii") regex = True if regex: # version using re.finditer() if not hasattr(sep, "finditer"): sep = re.compile(sep) start = 0 for m in sep.finditer(source): idx = m.start() assert idx >= start yield source[start:idx] start = m.end() yield source[start:] else: # version using str.find(), less overhead than re.finditer() sepsize = len(sep) start = 0 while True: idx = source.find(sep, start) if idx == -1: yield source[start:] return yield source[start:idx] start = idx + sepsize 

It can be used as you want ...

 >>> print list(isplit("abcb","b")) ['a','c',''] 

While find () or slicing searches are done a little in order, this should be minimal, as strings are represented as congruent arrays in memory.

+11
Mar 19 '12 at 15:38
source share

This is the version of the split() generator implemented through re.search() , which has no problem allocating too many substrings.

 import re def itersplit(s, sep=None): exp = re.compile(r'\s+' if sep is None else re.escape(sep)) pos = 0 while True: m = exp.search(s, pos) if not m: if pos < len(s) or sep is not None: yield s[pos:] break if pos < m.start() or sep is not None: yield s[pos:m.start()] pos = m.end() sample1 = "Good evening, world!" sample2 = " Good evening, world! " sample3 = "brackets][all][][over][here" sample4 = "][brackets][all][][over][here][" assert list(itersplit(sample1)) == sample1.split() assert list(itersplit(sample2)) == sample2.split() assert list(itersplit(sample3, '][')) == sample3.split('][') assert list(itersplit(sample4, '][')) == sample4.split('][') 

EDIT: Fixed handling of surrounding spaces if no separator characters are specified.

+8
Oct 05 2018-10-10
source share

Here is my implementation, which is much, much faster and more complete than the other answers here. It has 4 separate subfunctions for different cases.

I just copy the docstring of the main str_split function:




 str_split(s, *delims, empty=None) 

Divide the string s into the rest of the arguments, possibly omitting the empty parts (the empty keyword argument is responsible for this). This is a generator function.

When only one delimiter is provided, the string is simply split by it. empty by default True .

 str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' 

When multiple delimiters are specified, the string is split into the longest possible sequences of these delimiters by default, or if empty set to True , empty strings between delimiters are also included. Please note that delimiters in this case can only be single characters.

 str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' 

If no separators are specified, string.whitespace used, so the effect is the same as str.split() , except that this function is a generator.

 str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' 



 import string def _str_split_chars(s, delims): "Split the string `s` by characters contained in `delims`, including the \ empty parts between two consecutive delimiters" start = 0 for i, c in enumerate(s): if c in delims: yield s[start:i] start = i+1 yield s[start:] def _str_split_chars_ne(s, delims): "Split the string `s` by longest possible sequences of characters \ contained in `delims`" start = 0 in_s = False for i, c in enumerate(s): if c in delims: if in_s: yield s[start:i] in_s = False else: if not in_s: in_s = True start = i if in_s: yield s[start:] def _str_split_word(s, delim): "Split the string `s` by the string `delim`" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) yield s[start:i] start = i+dlen except ValueError: pass yield s[start:] def _str_split_word_ne(s, delim): "Split the string `s` by the string `delim`, not including empty parts \ between two consecutive delimiters" dlen = len(delim) start = 0 try: while True: i = s.index(delim, start) if start!=i: yield s[start:i] start = i+dlen except ValueError: pass if start<len(s): yield s[start:] def str_split(s, *delims, empty=None): """\ Split the string `s` by the rest of the arguments, possibly omitting empty parts (`empty` keyword argument is responsible for that). This is a generator function. When only one delimiter is supplied, the string is simply split by it. `empty` is then `True` by default. str_split('[]aaa[][]bb[c', '[]') -> '', 'aaa', '', 'bb[c' str_split('[]aaa[][]bb[c', '[]', empty=False) -> 'aaa', 'bb[c' When multiple delimiters are supplied, the string is split by longest possible sequences of those delimiters by default, or, if `empty` is set to `True`, empty strings between the delimiters are also included. Note that the delimiters in this case may only be single characters. str_split('aaa, bb : c;', ' ', ',', ':', ';') -> 'aaa', 'bb', 'c' str_split('aaa, bb : c;', *' ,:;', empty=True) -> 'aaa', '', 'bb', '', '', 'c', '' When no delimiters are supplied, `string.whitespace` is used, so the effect is the same as `str.split()`, except this function is a generator. str_split('aaa\\t bb c \\n') -> 'aaa', 'bb', 'c' """ if len(delims)==1: f = _str_split_word if empty is None or empty else _str_split_word_ne return f(s, delims[0]) if len(delims)==0: delims = string.whitespace delims = set(delims) if len(delims)>=4 else ''.join(delims) if any(len(d)>1 for d in delims): raise ValueError("Only 1-character multiple delimiters are supported") f = _str_split_chars if empty else _str_split_chars_ne return f(s, delims) 

This function works in Python 3, and a lightweight, albeit rather ugly, fix can be applied to make it work in both versions 2 and 3. The first lines of the function should be changed to:

 def str_split(s, *delims, **kwargs): """...docstring...""" empty = kwargs.get('empty') 
+5
Oct 06
source share

Have several performance tests been conducted for various methods (I will not repeat them here). Some results:

  • str.split (default = 0.3461570239996945
  • manual search (by nature) (one of Dave Webb's answers) = 0.8260340550004912
  • re.finditer (answer ninjagecko) = 0.698872097000276
  • str.find (one of the answers by Eli Collins) = 0.7230395330007013
  • itertools.takewhile (Answer by Ignacio Vasquez-Abram) = 2.023023967998597
  • str.split(..., maxsplit=1) recursion = N / A โ€ 

โ€  string.split answers ( string.split with maxsplit = 1 ) cannot complete in a reasonable string.split time, given the speed of string.split , they can work better on shorter lines, but then I can not see a use case for short lines in which the memory is all Equally not a problem.

Tested with timeit on:

 the_text = "100 " * 9999 + "100" def test_function( method ): def fn( ): total = 0 for x in method( the_text ): total += int( x ) return total return fn 

Another question arises as to why string.split is much faster, despite using memory.

+5
Feb 21 '17 at 16:51
source share

No, but it should be simple enough to write one using itertools.takewhile() .

EDIT:

Very simple, dilapidated implementation:

 import itertools import string def isplitwords(s): i = iter(s) while True: r = [] for c in itertools.takewhile(lambda x: not x in string.whitespace, i): r.append(c) else: if r: yield ''.join(r) continue else: raise StopIteration() 
+3
05 Oct 2018-10-10T00:
source share

I do not see any obvious benefits for the split() generator version. The generator object must contain the entire string for iteration, so you are not going to save memory with the generator.

If you want to write one, it would be pretty easy:

 import string def gsplit(s,sep=string.whitespace): word = [] for c in s: if c in sep: if word: yield "".join(word) word = [] else: word.append(c) if word: yield "".join(word) 
+3
Oct 05 '10 at 8:53
source share

I wrote a version of @ninjagecko's answer, which behaves more like string.split (i.e. spaces marked by default, and you can specify a separator).

 def isplit(string, delimiter = None): """Like string.split but returns an iterator (lazy) Multiple character delimters are not handled. """ if delimiter is None: # Whitespace delimited by default delim = r"\s" elif len(delimiter) != 1: raise ValueError("Can only handle single character delimiters", delimiter) else: # Escape, incase it "\", "*" etc. delim = re.escape(delimiter) return (x.group(0) for x in re.finditer(r"[^{}]+".format(delim), string)) 

Here are the tests I used (in both python 3 and python 2):

 # Wrapper to make it a list def helper(*args, **kwargs): return list(isplit(*args, **kwargs)) # Normal delimiters assert helper("1,2,3", ",") == ["1", "2", "3"] assert helper("1;2;3,", ";") == ["1", "2", "3,"] assert helper("1;2 ;3, ", ";") == ["1", "2 ", "3, "] # Whitespace assert helper("1 2 3") == ["1", "2", "3"] assert helper("1\t2\t3") == ["1", "2", "3"] assert helper("1\t2 \t3") == ["1", "2", "3"] assert helper("1\n2\n3") == ["1", "2", "3"] # Surrounding whitespace dropped assert helper(" 1 2 3 ") == ["1", "2", "3"] # Regex special characters assert helper(r"1\2\3", "\\") == ["1", "2", "3"] assert helper(r"1*2*3", "*") == ["1", "2", "3"] # No multi-char delimiters allowed try: helper(r"1,.2,.3", ",.") assert False except ValueError: pass 

The python regex module says it does the โ€œright thingโ€ for unicode spaces, but I haven't really tested it.

Also available as gist .

+3
Apr 17 '15 at 11:43 on
source share

If you would also like to read the iterator (and also return one), try this:

 import itertools as it def iter_split(string, sep=None): sep = sep or ' ' groups = it.groupby(string, lambda s: s != sep) return (''.join(g) for k, g in groups if k) 

Using

 >>> list(iter_split(iter("Good evening, world!"))) ['Good', 'evening,', 'world!'] 
+3
Jan 08 '16 at 12:54 on
source share

I wanted to show how to use the find_iter solution to return the generator for the given delimiters, and then use the pair recipe from itertools to build the previous next iteration, which will get the actual words, as in the original split method.




 from more_itertools import pairwise import re string = "dasdha hasud hasuid hsuia dhsuai dhasiu dhaui d" delimiter = " " # split according to the given delimiter including segments beginning at the beginning and ending at the end for prev, curr in pairwise(re.finditer("^|[{0}]+|$".format(delimiter), string)): print(string[prev.end(): curr.start()]) 



Note:

  • I use prev and curr instead of prev and next because overriding the next in python is a very bad idea.
  • It is quite effective.
+2
Dec 18 '17 at 2:54 on
source share

more_itertools.spit_at offers an analogue of str.split for iterators.

 >>> import more_itertools as mit >>> list(mit.split_at("abcdcba", lambda x: x == "b")) [['a'], ['c', 'd', 'c'], ['a']] >>> "abcdcba".split("b") ['a', 'cdc', 'a'] 

more_itertools is a third-party package.

+2
Jan 22 '18 at 6:21
source share
 def split_generator(f,s): """ f is a string, s is the substring we split on. This produces a generator rather than a possibly memory intensive list. """ i=0 j=0 while j<len(f): if i>=len(f): yield f[j:] j=i elif f[i] != s: i=i+1 else: yield [f[j:i]] j=i+1 i=i+1 
+1
Mar 11 '13 at 19:36
source share

For me, at least, I need the files used as generators.

This is the version I made to prepare for huge files with empty dividing blocks of text (this will need to be thoroughly tested for corner cases if you will use them in a production system):

 from __future__ import print_function def isplit(iterable, sep=None): r = '' for c in iterable: r += c if sep is None: if not c.strip(): r = r[:-1] if r: yield r r = '' elif r.endswith(sep): r=r[:-len(sep)] yield r r = '' if r: yield r def read_blocks(filename): """read a file as a sequence of blocks separated by empty line""" with open(filename) as ifh: for block in isplit(ifh, '\n\n'): yield block.splitlines() if __name__ == "__main__": for lineno, block in enumerate(read_blocks("logfile.txt"), 1): print(lineno,':') print('\n'.join(block)) print('-'*40) print('Testing skip with None.') for word in isplit('\tTony \t Jarkko \n Veijalainen\n'): print(word) 
0
Oct 26 '11 at 11:58
source share

here is a simple answer

 def gen_str(some_string): j=0 guard = len(some_string)-1 for i,s in enumerate(some_string): if s == '\n': yield some_string[j:i] j=i+1 elif i!=guard: continue else: yield some_string[j:] 
0
Feb 06 '19 at 16:54
source share

You can easily build it using str.split with the restriction:

 def isplit(s, sep=None): while s: parts = s.split(sep, 1) if len(parts) == 2: s = parts[1] else: s = '' yield parts[0] 

This way you do not need to replicate strip () functions and behavior (for example, when sep = None), and it depends on its possible quick implementation. I assume that string.split will stop scanning the string for delimiters when it has enough "parts".

As Glenn Maynard points out, this does not scale well for large strings (O (n ^ 2)). I confirmed this with timit tests.

-one
Oct 05 '10 at
source share