Why does df.limit keep changing in Pyspark?

Question

Why does df.limit keep changing in Pyspark?

I am creating a sample data from some data frame dfusing

rdd = df.limit(10000).rdd

This operation takes quite some time (why really? Can it not shrink after 10,000 lines?), So I assume that I now have a new RDD.

However, when I am working on rddit now , these are different lines each time I access it. As if he was reset again. RDD caching helps a bit, but surely what not to save?

What is the reason for this?

Update: here is the playback on Spark 1.5.2

from operator import add
from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100)
rdd1=rdd.toDF().limit(1000).rdd
for _ in range(3):
    print(rdd1.map(lambda row:row.i).reduce(add))

Output signal

499500
19955500
49651500

I am surprised that .rddit is not capturing data.

EDIT: , , , , Spark 2.0.0.2.5.0

from pyspark.sql import Row
rdd=sc.parallelize([Row(i=i) for i in range(1000000)],200)
rdd1=rdd.toDF().limit(12345).rdd
rdd2=rdd1.map(lambda x:(x,x))
rdd2.join(rdd2).count()
# result is 10240 despite doing a self-join

, , limit, . " ", ( 12345).

+7

apache-spark pyspark spark-dataframe

Gerenuk 10 '16 19:10

3

user3124181 · Answer 1 · 2016-07-22T20:52:42+0000

rdd , . - , , , rdds , pyspark :

>>> d = [{'name': 'Alice', 'age': 1, 'pet': 'cat'}, {'name': 'Bob', 'age': 2, 'pet': 'dog'}]
>>> df = sqlContext.createDataFrame(d)
>>> rdd = df.limit(1).rdd

rdd

>>> def p(x):
...    print x
...

>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')
>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')
>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')

,

alexgbelov · Answer 2 · 2019-01-30T22:30:23+0000

Spark , , limit(). , , , (.. 10 Parquet, limit 1, 7 ..).

santon · Answer 3 · 2017-01-09T23:33:47+0000

Spark , . "" 10 000 DataFrame. (, , ) , "". Spark. , , . , ..

Even when you cache data, I still won’t rely on getting the same data every time, although I would certainly expect it to be more consistent than reading from disk.

Why does df.limit keep changing in Pyspark?

More articles: