Initialization is the best way to do this. Initialization actions are shell scripts that run when a cluster is created. This will allow you to configure the cluster, for example, install Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters through the Google Cloud SDK or in the Google Developers Console.
Here is an example of an initialization action to set Python pandas to create a cluster only on the main node.
#!/bin/sh ROLE=$(/usr/share/google/get_metadata_value attributes/role) if [[ "${ROLE}" == 'Master' ]]; then apt-get install python-pandas -y fi
As you can see from this script, you can define the node role with /usr/share/google/get_metadata_value attributes/role , and then perform the action specifically for the wizard (or working) node.
Learn more about Google Dataproc Docs .
James source share