Specifying a File Name When Saving a DataFrame as a CSV

Question

Specifying a File Name When Saving a DataFrame as a CSV

Say I have a Spark DF that I want to save to a CSV file on disk. In Spark 2.0.0+, you can convert a DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.

Function is defined as

 def csv(path: String): Unit path : the location/folder name and not the file name.

Spark stores the csv file at the location specified when creating the CSV files named - part - *. csv.

Is there a way to save the CSV with the specified file name instead of the - * part. csv? Or can you specify a prefix instead of part-r?

The code:

 df.coalesce(1).write.csv("sample_path")

Current Output:

 sample_path | +-- part-r-00000.csv

Desired Result:

 sample_path | +-- my_file.csv

Note. The coalesce function is used to output a single file, and the artist has enough memory to collect DF without a memory error.

+7

scala csv apache-spark pyspark

Spandan brahmbhatt Feb 01 '17 at 21:28

source share

1 answer

T. gawęda · Accepted Answer · 2017-02-01T22:16:39+0000

Unable to do it directly in Spark save

Spark uses the Hadoop file format, which requires data separation - so you have part- files. You can easily change the file name after processing, as in this question

In Scala, it will look like this:

 import org.apache.hadoop.fs._; val fs = FileSystem.get(sc.hadoopConfiguration); val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName(); fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv")); fs.delete(new Path("mydata.csv-temp"), true);

or simply:

 import org.apache.hadoop.fs._; val fs = FileSystem.get(sc.hadoopConfiguration()); fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv"));

Edit: as mentioned in the comments, you can also write your own OutputFormat, see the docs for information on this approach for setting the file name

Specifying a File Name When Saving a DataFrame as a CSV

More articles: