Break an iterator into a predicate

I need a method that can split Iterator[Char] into strings (separated by \n and \r )

For this, I wrote a general method that receives an iterator and a predicate, and will split the iterator each time the predicate is true. This is similar to a span , but will break every time the predicate is true, not just the first time

this is my implementation:

 def iterativeSplit[T](iterO: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] = new Iterator[List[T]] { private var iter = iterO def hasNext = iter.hasNext def next = { val (i1,i2) = iter.span(el => !breakOn(el)) val cur = i1.toList iter = i2.dropWhile(breakOn) cur } }.withFilter(l => l.nonEmpty) 

and it works well on small inputs, but on large inputs it works very slowly, and sometimes I get an exception

here is the code that recreates the problem:

 val iter = ("aaaaaaaaabbbbbbbbbbbccccccccccccc\r\n" * 10000).iterator iterativeSplit(iter)(c => c == '\r' || c == '\n').length 

stack trace during run:

 ... at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:615) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) at scala.collection.Iterator$$anon$18.hasNext(Iterator.scala:591) at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847) ... 

looking at the source code (I use scala 2.10.4) line 591 is hasNext second iterator from span , and line 651 is hasNext in the iterator of dropWhile

I think I'm misusing these 2 iterators, but I don't understand why

+5
source share
1 answer

You can simplify your code as follows, which seems to solve the problem:

  def iterativeSplit2[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] = new Iterator[List[T]] { def hasNext = iter.hasNext def next = { val cur = iter.takeWhile(!breakOn(_)).toList iter.dropWhile(breakOn) cur } }.withFilter(l => l.nonEmpty) 

Instead of using span (so you need to replace iter with every next call), just use takeWhile and dropWhile in the original iter . Then there is no need for var .

I think the reason for your overflow of the source file is that repeatedly calling span creates a long chain of Iterator s, each of the hasNext methods calls the hasNext its Iterator parent. If you look at the source code for Iterator , you will see that each span creates new iterators that redirect calls to hasNext to the original iterator (via BufferedIterator , which further increases the call stack).

Update by contacting the documentation , it seems that although my solution above seems to work, it is not recommended - see in particular:

It is especially important to note that, unless otherwise specified, you can never use an iterator after calling a method on it. [...] Using the old undefined iterator can be changed and can also lead to changes in the new iterator.

which applies to takeWhile and dropWhile (and span ), but not next or hasNext .

You can use span as in your original solution, but using streams, not iterators, and recursion:

  def split3[T](s: Stream[T])(breakOn: T => Boolean): Stream[List[T]] = s match { case Stream.Empty => Stream.empty case s => { val (a, b) = s.span(!breakOn(_)) a.toList #:: split3(b.dropWhile(breakOn))(breakOn) } } 

But the performance is pretty terrible. I am sure there must be a better way ...

Update 2:. This is a very important solution that has the best performance:

 import scala.collection.mutable.ListBuffer def iterativeSplit4[T](iter: Iterator[T])(breakOn: T => Boolean): Iterator[List[T]] = new Iterator[List[T]] { val word = new ListBuffer[T] def hasNext() = iter.hasNext def next = { var looking = true while (looking) { val c = iter.next if (breakOn(c)) looking = false else word += c } val w = word.toList word.clear() w } }.withFilter(_.nonEmpty) 
+4
source

All Articles