How to get Hadoop to find imported Python modules when using Python UDF in Pig?

I am using Pig (0.9.1) with UDF written in Python. Python scripts import modules from the Python standard library. I was able to run Pig scripts that successfully invoke Python UDF in local mode, but when I start in the cluster, a message appears saying Python Hadoop was starting, which the imported modules cannot find. What should be done?

For example:

  • Do i need to install python (or jython) on every node task tracker?
  • Do i need to install python (or jython) modules on every node task tracker?
  • Do task tracking nodes need to know how to find modules?
  • If so, how do you specify the path (through the environment variable - how is this done for the task tracker)?
+7
source share
3 answers

Do i need to install python (or jython) on every node task tracker?

Yes, since it is done in the task tracker.

Should python (or jython) modules be installed for each node tracker task?

If you use a third-party module, it must also be installed in task trackers (for example, geoip, etc.).

Do task tracking nodes need to know how to find modules? If so, how do you specify the path (through the environment variable - how is this done for the task tracker)?

As an answer from Pig Programming :

Register

also used to find resources for Python UDF that you use in your Pig Latin scripts. In this case, you are not registering the jar, but rather a Python script that contains your UDF. The Python script should be in your current directory.

And also this is important:

Caution, Pig does not track dependencies inside your Python scripts and send the necessary Python modules to your Hadoop cluster. You need the modules you need to be on the nodes of the task in your cluster and that the environment variable PYTHONPATH is set on these nodes so that your UDFs can find them for import. This issue has been fixed after 0.9, but has been released at the time of this writing.

And if you use jython:

The pig does not know where the Jython interpreter is used on your system, so you must include jython.jar in your class path when calling Pig. This can be done by setting the environment variable PIG_CLASSPATH.

As a result, if you use streaming, you can use the "SHIP" command in the swing, which will send your executable files to the cluster. if you use UDF, as long as it can be compiled (note the note about jython) and has no dependency on it (which you have not yet inserted into PYTHONPATH / or installed in the cluster), UDF is sent to the cluster at runtime. (As a hint, this would make your life much easier if you put your simple UDF dependencies in the same folder using a pig script when registering)

Hope this clears things up.

+6
source

Adding

pig -Dmapred.child.env="JYTHONPATH=job.jar/Lib" script.pig 

Working. Please note that you can also add the following lines to your python script:

 import sys sys.path.append('./Lib') 

Also note that you will still receive numerous β€œmodule not found” warnings, but the fix works. The fact that you receive these warnings, even though the modules were actually detected, in the end is incredibly confusing, and I always killed the work of the home before it returned correctly, believing that it was a symptom fixes that actually doesn't work ..

+2
source

I ran into the same problem using Hadoop 1.2.1 and Pig 0.11.1 and found a workaround from PIG-2433 that should have added -Dmapred.child.env="JYTHONPATH=job.jar/Lib" to my arguments Pig . Example:

 pig -Dmapred.child.env="JYTHONPATH=job.jar/Lib" script.pig 
+1
source

All Articles