Spark toDebugString not like in python

Question

Spark toDebugString not like in python

This is what I get when I use toDebugString in scala :

scala> val a = sc.parallelize(Array(1,2,3)).distinct a: org.apache.spark.rdd.RDD[Int] = MappedRDD[3] at distinct at <console>:12 scala> a.toDebugString res0: String = (4) MappedRDD[3] at distinct at <console>:12 | ShuffledRDD[2] at distinct at <console>:12 +-(4) MappedRDD[1] at distinct at <console>:12 | ParallelCollectionRDD[0] at parallelize at <console>:12

This is the equivalent in python :

 >>> a = sc.parallelize([1,2,3]).distinct() >>> a.toDebugString() '(4) PythonRDD[6] at RDD at PythonRDD.scala:43\n | MappedRDD[5] at values at NativeMethodAccessorImpl.java:-2\n | ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2\n +-(4) PairwiseRDD[3] at RDD at PythonRDD.scala:261\n | PythonRDD[2] at RDD at PythonRDD.scala:43\n | ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:315'

As you can see, the output is not as good in python as in scala. Is there any trick to get a better output for this feature?

I am using Spark 1.1.0.

+7

python scala apache-spark

poiuytrez Oct 13 '14 at 14:15

source share

2 answers

it was not excluded, just cached you should use:

 a = sc.parallelize([1,2,3]).distinct() a.collect() [1, 2, 3]

0

user3409371 Dec 7 '15 at 14:08

source share

Josh rosen · Accepted Answer · 2014-10-13T14:55:36+0000

Try adding a print statement so that the debug line is actually printed and not __repr__ :

 >>> a = sc.parallelize([1,2,3]).distinct() >>> print a.toDebugString() (8) PythonRDD[27] at RDD at PythonRDD.scala:44 [Serialized 1x Replicated] | MappedRDD[26] at values at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated] | ShuffledRDD[25] at partitionBy at NativeMethodAccessorImpl.java:-2 [Serialized 1x Replicated] +-(8) PairwiseRDD[24] at distinct at <stdin>:1 [Serialized 1x Replicated] | PythonRDD[23] at distinct at <stdin>:1 [Serialized 1x Replicated] | ParallelCollectionRDD[21] at parallelize at PythonRDD.scala:358 [Serialized 1x Replicated]

Spark toDebugString not like in python

More articles: