Is python smart enough to replace function calls with a constant result?

From the beautiful world of c , I am trying to understand this behavior:

In [1]: dataset = sqlContext.read.parquet('indir') In [2]: sizes = dataset.mapPartitions(lambda x: [len(list(x))]).collect() In [3]: for item in sizes: ...: if(item == min(sizes)): ...: count = count + 1 ...: 

I wouldn’t even finish in 20 minutes, and I know that the sizes list is not so big, less than 205k in length. However, this is done instantly:

 In [8]: min_item = min(sizes) In [9]: for item in sizes: if(item == min_item): count = count + 1 ...: 

So what happened?

My guess: could not understand that min(sizes) would always be constant, replacing it after the first few calls with its return value. Because Python uses an interpreter.


The min () link does not say anything that could explain this question to me, but I thought it might be that you need to look at the sections for this, but it should not be so, since sizes is a list , not an RDD !


Edit:

Here is the source of my confusion, I wrote a similar program in C:

 for(i = 0; i < SIZE; ++i) if(i == mymin(array, SIZE)) ++count; 

and got these timings:

 C02QT2UBFVH6-lm:~ gsamaras$ gcc -Wall main.c C02QT2UBFVH6-lm:~ gsamaras$ ./a.out That took 98.679177000 seconds wall clock time. C02QT2UBFVH6-lm:~ gsamaras$ gcc -O3 -Wall main.c C02QT2UBFVH6-lm:~ gsamaras$ ./a.out That took 0.000000000 seconds wall clock time. 

and for timings, I used the Nomimal Animal approach from Time Dimension .

+6
source share
1 answer

I'm by no means an expert on python's internal work, but from my understanding so far, you would like to compare speed

 for item in sizes: if(item == min(sizes)): count = count + 1 

and

 min_item = min(sizes) for item in sizes: if(item == min_item): count = count + 1 

Now someone will correct me if I have any of this wrong, but

The python lists are variable and have no fixed length , and are treated as such, while in C the array has a fixed size. From this question :

Python lists are very flexible and can contain completely heterogeneous, arbitrary data, and can be added very efficiently, in an amortized time constant. If you need to shrink and grow your array efficiently and without hassle, this is the way to go. But they use a lot more space than C. arrays

Now take this example

 for item in sizes: if(item == min(sizes)): new_item = item - 1 sizes.append(new_item) 

Then the value of item == min(sizes) will differ at the next iteration. Python does not cache the resulting min(sizes) value, as it will violate the above example or require some logic to check if the list has been changed. Instead, it reaches you. min_item = min(sizes) defining min_item = min(sizes) , you essentially do the caching of the result yourself.

Now, since the array is a fixed size in C, it can find the min value with less overhead than the python list, so I think it has no problems in C (as well as C, which is a lower level language).

Again, I don’t quite understand the basic code and compilation for python, and I’m sure that if you analyze the looping process in python, you will see that python repeatedly calculates min(sizes) , causing an extreme amount of lag. I would like to know more about python's internal work (for example, any methods cached in a loop for python, or everything computed again for each iteration?), So if anyone has more information and / or fixes, let me know !

+5
source

All Articles