Add new line to text file in Spark

I read a text file in Spark using the command

val data = sc.textFile("/path/to/my/file/part-0000[0-4]") 

I would like to add a new line as the header of my file. Is there a way to do this without turning the RDD into an array?

Thanks!

+5
source share
3 answers

Part files are automatically processed as a set of files.

 val data = sc.textFile("/path/to/my/file") // Will read all parts. 

Just add a title and write it down:

 val header = sc.parallelize(Seq("...header...")) val withHeader = header ++ data withHeader.saveAsTextFile("/path/to/my/modified-file") 

Note that since this should read and write all the data, it will be a little slower than you can intuitively expect. (In the end, you just add one new row!) For this reason, it might be better for others not to add this header and instead store metadata (a list of columns) separately from the data.

+2
source

You cannot control whether the new line will be the first (title) or not, but you can create a new singlet RDD and merge it with the existing one:

 val extendedData = data ++ sc.makeRDD(Seq("my precious new line")) 

So

 extendedData.filter(_ startsWith "my precious").first() 

will probably prove that your row is added

+1
source

RDD is immutable. This means that you cannot modify the contents of the RDD after it is created. You can create a new RDD from a basic RDD using RDD transforms.

-2
source

All Articles