Error writing SchemaRDD re-layout to parquet using Spark SQL

I am trying to write Spark SQL tables to Parquet files. Due to other questions I need to reduce the number of sections before recording. My code

data.coalesce(1000,shuffle=true).saveAsParquetFile("s3n://...") 

It throws

 java.lang.NullPointerException at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:927) at org.apache.spark.rdd.PartitionCoalescer.currPrefLocs(CoalescedRDD.scala:174) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:191) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator$$anonfun$4$$anonfun$apply$2.apply(CoalescedRDD.scala:190) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.rdd.PartitionCoalescer$LocationIterator.<init>(CoalescedRDD.scala:185) at org.apache.spark.rdd.PartitionCoalescer.setupGroups(CoalescedRDD.scala:236) at org.apache.spark.rdd.PartitionCoalescer.run(CoalescedRDD.scala:337) at org.apache.spark.rdd.CoalescedRDD.getPartitions(CoalescedRDD.scala:83) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1128) at org.apache.spark.sql.parquet.InsertIntoParquetTable.saveAsHadoopFile(ParquetTableOperations.scala:318) at org.apache.spark.sql.parquet.InsertIntoParquetTable.execute(ParquetTableOperations.scala:246) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:409) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:409) at org.apache.spark.sql.SchemaRDDLike$class.saveAsParquetFile(SchemaRDDLike.scala:77) at org.apache.spark.sql.SchemaRDD.saveAsParquetFile(SchemaRDD.scala:103) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:22) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:29) at $iwC$$iwC$$iwC.<init>(<console>:31) at $iwC$$iwC.<init>(<console>:33) at $iwC.<init>(<console>:35) at <init>(<console>:37) at .<init>(<console>:41) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

The code works fine if I take out the coalescence step and change the code to use shuffle=true or using repartition , it gives the same error. I am using spark-1.1.0.

+3
apache-spark apache-spark-sql parquet
source share

No one has answered this question yet.

See similar questions:

nine
Spark SQL cannot complete parquet data logging with a lot of shards
8
EntityTooLarge error while loading 5G file in Amazon S3
7
Saving >> 25T SchemaRDD in S3 Parquet format

or similar:

6
Spark Accesing Hdfs Gives TokenCache Error Cannot Get Master Kerberos to Use as Producer
5
<console>: 22: error: not found: sc value
2
Spark reads s3 using sc.textFile ("s3a: // bucket / filePath"). java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager
2
Spark provides an InvalidProtocolBufferException when requesting a Hadoop
one
Spark SQL - Escape query string
one
ML- spark not loading model using MatrixFactorizationModel
0
Spark dataframe timestamp column displayed as InvalidType from the Mapr database table
0
Apache Phoenix SQL Exception HBase Data Using Spark
0
Error reading file using spark
-3
How to ftp file with SparkContext.textFile?

All Articles