I cannot understand why mapWithState needs to be run on a single node
This is not true. Spark uses the HashPartitioner by default to share your keys among the different work nodes in your cluster. If for some reason you see all your data stored on another node, check the distribution of your keys. If this is a custom object that you use as a key, make sure the hashCode method is implemented correctly. This can happen if something is wrong with the key distribution. If you want to check this out, try using random numbers as your keys and look at the Spark interface and see if this behavior has changed.
I run mapWithState , and the input is partitioned by key, since I also have a call to reduceByKey before it holds the state, and when I look at the Storage tab on the Spark UI, I see that different RDDs are stored on different working nodes in a cluster.
source share