Spark has broadcast variables that are read-only, and accumulator variables that can be updated by nodes but not read. Is there a way - or a workaround - to define a variable that is updatable and can be read?
One of the requirements for such a read / write global variable would be a cache implementation. Since the files are downloaded and processed as rdd, the calculation is performed. The results of these calculations, occurring in several nodes working in parallel, should be placed on the card, which is key for some attributes of the processed object. When subsequent objects are processed in rdd, a cache is requested.
Scala has ScalaCache , which is a facade for implementing a cache, such as Google Guava . But how would such a cache be enabled and available in the Spark app?
The cache can be defined as a variable in the driver application that creates the SparkContext . But then two problems arise:
- Performance would apparently be poor due to network overhead between the nodes and the driver application.
- As far as I understand, a copy of the variable will be passed to each rdd (cache in this case), when the variable is first opened, the function is transferred to rdd. Each rdd will have its own copy, not access to a shared global variable.
What is the best way to implement and store such a cache?
thanks
source share