Let me break it.
First, why does HWM not work:
HWM is not an exact limit, since internal buffers are filled and emptied by two separate threads, and the amount of available space can lag significantly when there is a lot of activity. The 0MQ zmq_setsockopt page says: "0MQ does not guarantee that the socket will receive the same number of ZMQ_SNDHWM messages, and the actual limit may be 60-70% lower depending on the message flow in the socket."
Secondly, why are you losing messages:
As you unload 0.5M (x 20) messages into socket buffers, you accidentally fall into HWM, and the behavior of the PUB socket should then discard messages that it cannot queue.
Third, how to solve this:
There is no reason to split the state into separate messages; the only justification for this would be if the state did not fit into the memory, which is easy to do. Send as multipart (ZMQ_SNDMORE); this creates one effective message, which occupies 1 slot in the outgoing buffer.
Then remove the 500W HWM limit and return the default value (1000), which will be more than sufficient.
Fourth, how to achieve better performance:
Obviously profile and improve your publisher and subscriber code as much as possible; these are the usual bottlenecks.
Then consider some form of message compression if it is sparse, and you can do it without too much processor cost. With 20 subscribers, you usually get more from network costs than you lose from processor cost.
Finally, if you increase the number of subscribers and this is a critical system, look at the PGM multicast, which will effectively eliminate network costs.