Unexpected Scala Collection Memory Behavior

Question

Unexpected Scala Collection Memory Behavior

The following Scala code (at 2.9.2):

var a = ( 0 until 100000 ).toStream for ( i <- 0 until 100000 ) { val memTot = Runtime.getRuntime().totalMemory().toDouble / ( 1024.0 * 1024.0 ) println( i, a.size, memTot ) a = a.map(identity) }

uses an increasing amount of memory at each iteration of the loop. If a defined as ( 0 until 100000 ).toList , then memory usage is stable (give or take GC).

I understand that threads are evaluated lazily, but retain elements after they are created. But it seems that in my code above, each new thread (generated by the last line of code) somehow maintains a link to previous threads. Can someone help explain?

+4

memory-management collections scala stream

Alex wilson Feb 15 '13 at 14:46

source share

1 answer

Tomasz Nurkiewicz · Accepted Answer · 2013-02-15T15:10:14+0000

This is what happens. Stream always evaluated lazily, but already calculated elements are "cached" later. Lazy assessment is crucial. Take a look at this piece of code:

 a = a.flatMap( v => Some( v ) )

Although it looks like you converted one Stream to another and discarded the old one, this is not what happens. The new Stream still maintains a link to the old. This is because the result of the Stream does not have to eagerly compute all the elements of the underlying stream, but do so on demand. Take this as an example:

 io.Source.fromFile("very-large.file").getLines().toStream. map(_.trim). filter(_.contains("X")). map(_.substring(0, 10)). map(_.toUpperCase)

You can link as many operations as you want, but the file is barely touched to read the first line. Each subsequent operation simply terminates the previous Stream containing a link to the child stream. The moment you ask for size or do foreach , the evaluation begins.

Return to your code. In the second iteration, you create a third thread that contains a link to the second one, which takes turns linking to the one you originally defined. Basically you have a stack of fairly large objects.

But this does not explain why memory runs so fast. The most important part ... println() , or a.size , to be precise. Without printing (and thus evaluating all Stream ), the Stream remains "unappreciated." The invaluable stream does not cache any values, therefore it is very thin. Memory is still flowing due to the growing chain of flows into each other, but much, much slower.

This asks questions: why it works with toList It's pretty simple. List.map() eagerly creates a new List . Period. The previous one is no longer referenced and is not entitled to GC.

Unexpected Scala Collection Memory Behavior

More articles: