Specifying a File Name When Saving a DataFrame as a CSV

Say I have a Spark DF that I want to save to a CSV file on disk. In Spark 2.0.0+, you can convert a DataFrame(DataSet[Rows]) as a DataFrameWriter and use the .csv method to write the file.

Function is defined as

 def csv(path: String): Unit path : the location/folder name and not the file name. 

Spark stores the csv file at the location specified when creating the CSV files named - part - *. csv.

Is there a way to save the CSV with the specified file name instead of the - * part. csv? Or can you specify a prefix instead of part-r?

The code:

 df.coalesce(1).write.csv("sample_path") 

Current Output:

 sample_path | +-- part-r-00000.csv 

Desired Result:

 sample_path | +-- my_file.csv 

Note. The coalesce function is used to output a single file, and the artist has enough memory to collect DF without a memory error.

+7
scala csv apache-spark pyspark
source share
1 answer

Unable to do it directly in Spark save

Spark uses the Hadoop file format, which requires data separation - so you have part- files. You can easily change the file name after processing, as in this question

In Scala, it will look like this:

 import org.apache.hadoop.fs._; val fs = FileSystem.get(sc.hadoopConfiguration); val file = fs.globStatus(new Path("path/file.csv/part*"))(0).getPath().getName(); fs.rename(new Path("csvDirectory/" + file), new Path("mydata.csv")); fs.delete(new Path("mydata.csv-temp"), true); 

or simply:

 import org.apache.hadoop.fs._; val fs = FileSystem.get(sc.hadoopConfiguration()); fs.rename(new Path("csvDirectory/data.csv/part-0000"), new Path("csvDirectory/newData.csv")); 

Edit: as mentioned in the comments, you can also write your own OutputFormat, see the docs for information on this approach for setting the file name

+6
source share

All Articles