What would be the best way to get Spark to work on a distributed database? (RDD or FileInputFormat)

We have some kind of distributed data warehouse. We know all the internal components and can access data directly on disk.

I am exploring the possibility of deploying Apache Spark directly above it.

What would be the best / recommended way to do this?

  • Custom RDD entry (output from RDD)
  • Or through the FileInputFormat extension ?

(easier than another?) Better performance, etc.

thanks for the help

+4
source share
2 answers

Spark 1.2 API Spark SQL. () SQL , . Spark SQL (, RDD API), , , , Spark SchemaRDD.

API:

+3

All Articles