Writing and reading source byte arrays in Spark - using the Sequence File SequenceFile

Question

Writing and reading source byte arrays in Spark - using the Sequence File SequenceFile

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it again?

+7

scala hadoop hdfs apache-spark sequencefile

samthebest Jun 06 '14 at 13:42

source share

2 answers

Here is a snippet with all the necessary imports that you can run from spark-shell, as @Choix requires

 import org.apache.hadoop.io.BytesWritable import org.apache.hadoop.io.NullWritable val path = "/tmp/path" val rdd = sc.parallelize(List("foo")) val bytesRdd = rdd.map{str => (NullWritable.get, new BytesWritable(str.getBytes) ) } bytesRdd.saveAsSequenceFile(path) val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes()) val recoveredAsString = recovered.map( new String(_) ) recoveredAsString.collect() // result is: Array[String] = Array(foo)

0

Chris bedford Jul 27 '19 at 22:17

source share

samthebest · Accepted Answer · 2014-06-06T13:42:55+0000

Common problems seem to seem strange, cannot throw an exception from BytesWritable to NullWritable. Another common problem is BytesWritable getBytes - an absolutely pointless bunch of nonsense that doesn't receive bytes at all. What getBytes does is get your bytes, which adds a ton of zeros at the end! You must use copyBytes

 val rdd: RDD[Array[Byte]] = ??? // To write rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray))) .saveAsSequenceFile("/output/path", codecOpt) // To read val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path") .map(_._2.copyBytes())

Writing and reading source byte arrays in Spark - using the Sequence File SequenceFile

More articles: