We have some kind of distributed data warehouse. We know all the internal components and can access data directly on disk.
I am exploring the possibility of deploying Apache Spark directly above it.
What would be the best / recommended way to do this?
- Custom RDD entry (output from RDD)
- Or through the FileInputFormat extension ?
(easier than another?) Better performance, etc.
thanks for the help
Ariel source
share