How to merge an iterator parser

I have Iterator[(A1,B1)] and two functions

  • fA: (Iterator[A1]) => Iterator[A2] and
  • fB: (Iterator[B1]) => Iterator[B2] .

Is it possible to make fAB: (Iterator[(A1,B1)]) => Iterator[(A2,B2)] without converting iterators to Seq?

Edit

Both answers below are good. I chose @Aivean answer because the code is simpler and uses a specialized scala data structure (stream).

The only drawback is the stack limitation, but this should not be a problem for most use cases. If your iterator can be very (very) long, @Alexey is recommended.

+7
iterator scala
source share
2 answers

I came up with a much simpler implementation:

 def iterUnzip[A1, B1, A2, B2](it: Iterator[(A1, B1)], fA: (Iterator[A1]) => Iterator[A2], fB: (Iterator[B1]) => Iterator[B2]) = it.toStream match { case s => fA(s.map(_._1).toIterator).zip(fB(s.map(_._2).toIterator)) } 

The idea is to convert an iterator to a stream. Stream in Scala is lazy, but also provides memoization . This effectively provides the same buffering mechanism as @AlexeyRomanovโ€™s solution, but more compressed. The only drawback is that Stream stores memoized elements on the stack as opposed to an explicit queue, so if fA and fB create elements at an uneven speed, you can get a StackOverflow exception.

Test that the score is lazy:

 val iter = Stream.from(0).map(x => (x, x + 1)) .map(x => {println("fetched: " + x); x}).take(5).toIterator iterUnzip( iter, (_:Iterator[Int]).flatMap(x => List(x, x)), (_:Iterator[Int]).map(_ + 1) ).toList 

Result:

 fetched: (0,1) iter: Iterator[(Int, Int)] = non-empty iterator fetched: (1,2) fetched: (2,3) fetched: (3,4) fetched: (4,5) res0: List[(Int, Int)] = List((0,2), (0,3), (1,4), (1,5), (2,6)) 

I also tried hard enough to get a StackOverflow exception by creating uneven iterators, but couldn't.

 val iter = Stream.from(0).map(x => (x, x + 1)).take(10000000).toIterator iterUnzip( iter, (_:Iterator[Int]).flatMap(x => List.fill(1000000)(x)), (_:Iterator[Int]).map(_ + 1) ).size 

Works great on -Xss5m and produces:

 res10: Int = 10000000 

Thus, in general, this is a reasonably good and concise solution if you do not have any extreme cases.

+2
source share

Not. Your hypothetical function should first call one of fA and fB . Let them say that he calls fA , and he asks for all A1 before doing anything. Then you do not have B1 left to go to fB unless you save them somewhere, potentially leaking memory. If acceptable, you can do:

 def unzip[A, B](iter: Iterator[(A, B)]) = { var qA = Queue.empty[A] var qB = Queue.empty[B] val iterA = new Iterator[A] { override def hasNext = qA.nonEmpty || iter.hasNext override def next() = qA.dequeueOption match { case Some((a, qA1)) => qA = qA1 a case None => val (a, b) = iter.next() qB = qB.enqueue(b) a } } // similar iterB (iterA, iterB) } 

and then

 val (iterA, iterB) = unzip(iterator) fA(iterAfA).zip(fB(iterB)) 

(Well, you can also write iterator => fA(iterator.map(_._1)).zip(fB(iterator.map(_._2)) : it is of the correct type, but probably not the one you if you want, namely, it will use only one field of each tuple created by the original iterator and discard the other.)

+3
source share

All Articles