I am creating a sample data from some data frame dfusing
rdd = df.limit(10000).rdd
This operation takes quite some time (why really? Can it not shrink after 10,000 lines?), So I assume that I now have a new RDD.
However, when I am working on rddit now , these are different lines each time I access it. As if he was reset again. RDD caching helps a bit, but surely what not to save?
What is the reason for this?
Update: here is the playback on Spark 1.5.2
from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
rdd1=rdd.toDF().limit(1000).rdd
for _ in range(3):
print(rdd1.map(lambda row:row.i).reduce(add))
Output signal
499500
19955500
49651500
I am surprised that .rddit is not capturing data.
EDIT:
, , , , Spark 2.0.0.2.5.0
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x))
rdd2.join(rdd2).count()
, , limit, . " ", ( 12345).