How to install Python libraries automatically when starting a Dataproc cluster?

How can I automatically install Python libraries in my Dataproc cluster when the cluster starts? This will save me from having to manually register on the main and / or work nodes in order to manually install the libraries I need.

It would also be great to know if this automatic installation can only be installed for the wizard, and not for the workers.

+6
source share
1 answer

Initialization is the best way to do this. Initialization actions are shell scripts that run when a cluster is created. This will allow you to configure the cluster, for example, install Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters through the Google Cloud SDK or in the Google Developers Console.

Here is an example of an initialization action to set Python pandas to create a cluster only on the main node.

#!/bin/sh ROLE=$(/usr/share/google/get_metadata_value attributes/role) if [[ "${ROLE}" == 'Master' ]]; then apt-get install python-pandas -y fi 

As you can see from this script, you can define the node role with /usr/share/google/get_metadata_value attributes/role , and then perform the action specifically for the wizard (or working) node.

Learn more about Google Dataproc Docs .

+6
source

All Articles