Spark mapWithState moves all data to one node

Question

Spark mapWithState moves all data to one node

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to track the data viewed from previous installments.

The state is divided into 20 sections created using StateSpec.function(trackStateFunc _).numPartitions(20) . I was hoping to distribute the state across the entire cluster, but it seems that each node contains the full state, and execution is always executed only through exactly one node.

Locality Level Summary: Node local: 50 displayed in the user interface for each batch, and the full batch is displayed in random order. Subsequently, I write to Kafka, and the sections again propagate across the cluster. I cannot understand why mapWithState() needs to be run on the same node. Does this not destroy the concept of state splitting if it is limited to one node instead of a full cluster? Is it not possible to distribute state by key?

+6

scala apache-spark spark-streaming

Lawrence benson Mar 22 '16 at 10:04

source share

2 answers

Yuval Itzchakov · Answer 1 · 2016-04-15T15:27:50+0000

I cannot understand why mapWithState needs to be run on a single node

This is not true. Spark uses the HashPartitioner by default to share your keys among the different work nodes in your cluster. If for some reason you see all your data stored on another node, check the distribution of your keys. If this is a custom object that you use as a key, make sure the hashCode method is implemented correctly. This can happen if something is wrong with the key distribution. If you want to check this out, try using random numbers as your keys and look at the Spark interface and see if this behavior has changed.

I run mapWithState , and the input is partitioned by key, since I also have a call to reduceByKey before it holds the state, and when I look at the Storage tab on the Spark UI, I see that different RDDs are stored on different working nodes in a cluster.

Balaram raju · Answer 2 · 2016-10-05T11:59:42+0000

Are you launching a spark on a deployment cluster? please check it.

Also make sure that you install -num-executors 20 -executor-core 10, because if you do not run dynamic allocation by default, it will assign 2 executors.

Spark mapWithState moves all data to one node

More articles: