How to split Spark RDD between 2 Spark contexts?

I have an RMI cluster. Each RMI server has a Spark context. Is there a way to share RDD between different Spark contexts?

+8
apache-spark rdd
source share
3 answers

As Daniel Darabos has already stated, this is not possible. Each distributed object in Spark is limited by the specific context that was used to create it ( SparkContext in the case of RDD, SQLContext in the case of a DataFrame ). If you want to share objects between applications, you must use common contexts (see, for example, spark-jobserver , Livy, or Apache Zeppelin ). Since an RDD or DataFrame is just a small local object, there really aren't many.

Data exchange is a completely different problem. You can use a specialized memory cache ( Apache Ignite ) or distribute it in file systems (for example, Alluxio - the former Tachyon) to minimize the delay when switching between applications, but you cannot avoid it.

+14
source share

No, RDD is bound to one SparkContext . The general idea is that you have a Spark cluster and one driver program that tells the clan what to do. This driver will perform SparkContext operations and run on RDD.

If you just want to move the RDD from one driver program to another, the solution is to write it to disk (S3 / HDFS / ...) in the first driver and load it from the disk into another driver.

+3
source share

In my understanding, you are not inclined, RDD is not data, but a way to create data using transformations / filters from the source data.

Another idea is to separate the final data. Thus, you will store RDD in a data warehouse, for example: - HDFS (parquet file, etc.) - Elicsearch - Apache Ignite (in-memory)

I think you will like Apache Ignite: https://ignite.apache.org/features/igniterdd.html

Apache Ignite provides an implementation of the Spark RDD abstraction which allows you to easily exchange data in memory through several Spark workstations, either in one application or between different Spark applications.

IgniteRDD is implemented as a representation of the distributed Ignite cache, which can be deployed either during the execution of the Spark job, or at the Spark workstation or in its own cluster.

(I let you dig your documentation to find what you are looking for.)

0
source share

All Articles