Can I use functions imported from .py files in Dask / Distributed?

I have a question about serialization and import.

  • should functions have their own import? as I saw with PySpark
  • Is the following wrong? Should mod.py be the conda / pip package? mod.py was written to the shared file system.

 In [1]: from distributed import Executor In [2]: e = Executor('127.0.0.1:8786') In [3]: e Out[3]: <Executor: scheduler="127.0.0.1:8786" processes=2 cores=2> In [4]: import socket In [5]: e.run(socket.gethostname) Out[5]: {'172.20.12.7:53405': 'n1015', '172.20.12.8:53779': 'n1016'} In [6]: %%file mod.py ...: def hostname(): ...: return 'the hostname' ...: Overwriting mod.py In [7]: import mod In [8]: mod.hostname() Out[8]: 'the hostname' In [9]: e.run(mod.hostname) distributed.utils - ERROR - No module named 'mod' 
+6
source share
2 answers

Quick response

Download the mod.py file to all your employees. You can do this using any mechanism that you used to configure dask.distributed, or you can use the upload_file method

 e.upload_file('mod.py') 

Alternatively, if your function is executed in IPython, and is not part of the module, it will be sent without problems.

Long answer

All this has to do with how functions are serialized in Python. Functions from modules are serialized by their module name and function name

 In [1]: from math import sin In [2]: import pickle In [3]: pickle.dumps(sin) Out[3]: b'\x80\x03cmath\nsin\nq\x00.' 

So, if the client computer wants to refer to the math.sin function, which it sends along with this byte (which you will notice, there is 'math' and 'sin' in it, hidden among other bytes) on the working computer. The employee looks at this byte message and says: “OK, the function I want is in such-and-such module, let me go and find it in my local file system. If the module is missing, then it will cause an error, like that, what you got above.

For dynamically created functions (functions that you create in IPython), it takes a completely different approach, combining all the code. This approach usually works great.

Generally speaking, Dask assumes that the workers and the client have the same software environment. Usually this is mainly handled by those who are setting up your cluster using some other tool like Docker. Methods like upload_file should fill in the blanks if you have files or scripts that are updated more often.

+3
source

To run an imported function in your cluster that is not available in the production environment, you can also create a local function from the imported function. This local function will then be pickled on cloudpickle . In Python 2, you can achieve this with new.function (see the new module ). For Python 3, this can be achieved using the type module , but I have not tried it.

Your example above would look like this:

 In [3]: import mod In [4]: import new In [5]: def remote(func): ...: return new.function(func.func_code, func.func_globals, closure=func.func_closure) ...: In [6]: e.run(remote(mod.hostname)) Out[6]: {'tcp://10.0.2.15:44208': 'the hostname'} 
+2
source

All Articles