How to define read \ write global variables in Spark

Spark has broadcast variables that are read-only, and accumulator variables that can be updated by nodes but not read. Is there a way - or a workaround - to define a variable that is updatable and can be read?

One of the requirements for such a read / write global variable would be a cache implementation. Since the files are downloaded and processed as rdd, the calculation is performed. The results of these calculations, occurring in several nodes working in parallel, should be placed on the card, which is key for some attributes of the processed object. When subsequent objects are processed in rdd, a cache is requested.

Scala has ScalaCache , which is a facade for implementing a cache, such as Google Guava . But how would such a cache be enabled and available in the Spark app?

The cache can be defined as a variable in the driver application that creates the SparkContext . But then two problems arise:

  • Performance would apparently be poor due to network overhead between the nodes and the driver application.
  • As far as I understand, a copy of the variable will be passed to each rdd (cache in this case), when the variable is first opened, the function is transferred to rdd. Each rdd will have its own copy, not access to a shared global variable.

What is the best way to implement and store such a cache?

thanks

+10
source share
1 answer

Well, the best way to do this is not to do all this. In general, the Spark processing model makes no guarantees * regarding

  • Where,
  • when,
  • in what order (except, of course, the order of transformations defined by the / DAG line)
  • and how many times

this piece of code is executed. Moreover, any updates that directly depend on the Spark architecture are not granular.

These are the properties that make Spark scalable and flexible, but at the same time it makes maintaining a common mutable state very difficult to implement and most of the time completely useless.

If you need a simple cache, you have several options:

  • use one of the methods described by Zach Zohar in Caching in Spark
  • use local caching (for each JVM thread or worker thread) in combination with application-specific partitioning to maintain locality
  • to communicate with external systems, use the local host cache independent of Spark (for example, Nginx proxies for http requests)

If an application requires a much more complex interaction, you can use various messaging tools to maintain a synchronized state, but overall it requires complex and potentially fragile code.


* This has partially changed in Spark 2.4 with the introduction of the barrier execution mode ( SPARK-24795 , SPARK-24822 ).

+9
source

All Articles