Spark: Broadcast Variables: It looks like you are trying to reference a SparkContext from a broadcast variable, action, or brute force

Class ProdsTransformer: def __init__(self): self.products_lookup_hmap = {} self.broadcast_products_lookup_map = None def create_broadcast_variables(self): self.broadcast_products_lookup_map = sc.broadcast(self.products_lookup_hmap) def create_lookup_maps(self): // The code here builds the hashmap that maps Prod_ID to another space. pt = ProdsTransformer () pt.create_broadcast_variables() pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID])) 

I get the following error:

β€œException: it looks like you are trying to reference a SparkContext from a broadcast variable, action, or transform. SparkContext can only be used in the driver, not in the code that it runs on workers. For more information, see SPARK-5063.

Any help on how to deal with broadcast variables would be great!

+6
source share
1 answer

By matching the object containing your broadcast variable in your map lambda, Spark will try to serialize the entire object and send it to workers. Since the object contains a reference to SparkContext, you get an error. Instead of this:

 pairs = distinct_users_projected.map(lambda x: (x.user_id, pt.broadcast_products_lookup_map.value[x.Prod_ID])) 

Try the following:

 bcast = pt.broadcast_products_lookup_map pairs = distinct_users_projected.map(lambda x: (x.user_id, bcast.value[x.Prod_ID])) 

The latter avoids object reference ( pt ), so Spark should only send a broadcast variable.

+8
source

All Articles