I find it difficult to understand how Spark interacts with storage.
I would like to create a Spark cluster that retrieves data from the RocksDB database (or any other keystore). However, at this point, the best I can do is collect the entire data set from the database into memory in each node of the cluster (for example, on the map) and build the RDD from this object.
What do I need to do to extract only the necessary data (e.g. Spark with HDFS)? I read about Hadoop Input Format and Record Readers, but I donβt quite understand what I should implement.
I know this is a pretty broad question, but I would really appreciate help to get me started. Thank you in advance.
hadoop apache-spark rocksdb
Pablodeacero
source share