Scala quirky in this while loop

Yesterday, this piece of code caused me a headache. I fixed this by reading the file line by line. Any ideas?

The while loop never starts, although the number of lines in the file is greater than 1.

val lines = Source.fromFile( new File("file.txt") ).getLines; println( "total lines:"+lines.size ); var starti = 1; while( starti < lines.size ){ val nexti = Math.min( starti + 10, lines.size ); println( "batch ("+starti+", "+nexti+") total:" + lines.size ) val linesSub = lines.slice(starti, nexti) //do something with linesSub starti = nexti } 
+7
source share
4 answers

It is really complicated, and I would even say that it is a mistake in Iterator . getLines returns an Iterator that goes lazily. So it looks like if you ask for lines.size , then the iterator looks through the whole file to count the lines. Subsequently, he "exhausted":

 scala> val lines = io.Source.fromFile(new java.io.File("....txt")).getLines lines: Iterator[String] = non-empty iterator scala> lines.size res4: Int = 15 scala> lines.size res5: Int = 0 scala> lines.hasNext res6: Boolean = false 

You see that when you execute size twice, the result is zero.

There are two solutions: either you force the iterator to be "stable" in something, for example, lines.toSeq . Or you forgot about size and do a โ€œnormalโ€ iteration:

 while(lines.hasNext) { val linesSub = lines.take(10) println("batch:" + linesSub.size) // do something with linesSub } 
+14
source

None of the above answers hit the nail on the head.

This is a good reason Iterator is returning. If you are lazy, it takes pressure from the heap, and the line representing each line can be garbage collected as soon as you finish with it. In the case of large files, this may be important to throw an OutOfMemoryException.

Ideally, you will work directly with the iterator, and not force it into a strict collection type.

Using grouped , then, as per om-nom-nom:

 for (linesSub <- lines grouped 10) { //do something with linesSub } 

And if you want to keep the println counter, write to the index:

 for ( (linesSub, batchIdx) <- (lines grouped 10).zipWithIndex ) { println("batch " + batchIdx) //do something with linesSub } 

If you really need the amount, call getLines twice. Once for counting and a second time for actual line processing.

+5
source

The second time you call line.size, it returns 0. This is because lines is an iterator, not an array.

+4
source

I rewrote your code using Seq , which was suggested in @ 0__'s answer:

 val batchSize = 10; val lines = Source.fromFile("file.txt").getLines.toSeq; println( "total lines:"+lines.length); var processed = 0; lines.grouped(batchSize).foreach( batch => { println( "batch ("+processed+","+(processed+Math.min(lines.length-processed,batchSize))+") total:"+lines.length ); processed = processed + batchSize; //do something with batch } ) 
+4
source

All Articles