Separate RDD in Apache Spark, so a single partition consists of a single file

Question

Separate RDD in Apache Spark, so a single partition consists of a single file

I am creating one RDD 2.csv file such as

val combineRDD = sc.textFile("D://release//CSVFilesParellel//*.csv")

Then I want to define a custom section on this RDD so that one section should contain one file. so that each section of ieone csv file is processed through one node for faster data processing

Is it possible to write a custom delimiter based on the size of the file or the number of lines in one file or at the end of a file character of one file?

How do I achieve this?

The structure of one file looks something like this:

00-00

Time (in seconds) Measure1 Measure2 Measure3 ..... Measuren

0

0.25

0.50

0.75

1

...

3600

1. The first line of data contains hours: mins Each file contains data for 1 hour or 3600 seconds

2. - , 4 250 , 250

: , : --. , ,
→ , RDD , .
, node , , .

.

,

Vinay Joglekar

+4

scala csv bigdata apache-spark

Vinay Joglekar 08 . '16 11:32

2

KrazyGautam · Answer 1 · 2017-03-20T20:56:39+0000

.

BigData , . , parallelism , , /
, , parallelism .
- textInpuTFormat gzip lzo (lzo ).
Gzip, , , anyKind
FileInputFormat splitlogic recordReader.

,

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/

Brad · Answer 2 · 2017-03-20T21:46:33+0000

, , . , .

// create firstRDD containing a new attribute `filename=first.csv`
val firstRDD = sc.textFile("D://release//CSVFilesParellel//first.csv")
    .map(line => new CsvRecord(line))

// create secondRDD containing a new attribute `filename=second.csv`
val secondRDD = sc.textFile("D://release//CSVFilesParellel//second.csv")
    .map(line => new CsvRecord(line))

// now create a pair RDD and re-partition on the filename
val partitionRDD = firstRDD.union(secondRDD)
    .map(csvRecord => (csvRecord.filename,csvRecord))
    .partitionBy(customFilenamePartitioner)

customFilenamePartitioner, org.apache.spark.Partitioner :
NumPartitions: Int, , .
getPartition (: ): Int, ( 0 numPartitions-1) .
equals(),: Java-. , Spark Partitioner , , RDD , .

, , , , , RDD, .

Separate RDD in Apache Spark, so a single partition consists of a single file

More articles: