After I scratched my head over “NO tasks have been started” to set pyspark for some time, the problem was isolated as:
Works:
ssc = HiveContext(sc) sqlRdd = ssc.sql(someSql) sqlRdd.collect()
Adding to repartition (), and it hangs "No jobs have started yet":
ssc = HiveContext(sc) sqlRdd = ssc.sql(someSql).repartition(2) sqlRdd.collect()
It is at 1.2.0 complete with CDH5
source share