How to ensure that python function generates its output based only on its input?

Question

How to ensure that python function generates its output based only on its input?

To generate output, a function usually uses only the values of its arguments. However, there are also cases where a function, in order to generate its output, reads something from the file system or from a database or from the Internet. I would like to have a simple and reliable way to make sure something like this does not happen.

One way that I see is to create a whitelist of python libraries that can be used to read from the file system, database or the Internet. But if this is the way, I can get this (potentially huge) list. Moreover, I do not want to disable the entire library just because it can be used to read from the file system. For example, I want users to be able to use the pandas library (for storing and processing tabular data). I just don't want them to be able to use this library to read data from the file system.

Is there a solution to this problem?

+7

python database filesystems functional-programming httpwebrequest

Roman Oct 10 '14 at 15:35

source share

3 answers

Your mandatory restrictions may be damaged even if you remove all modules and all functions. Code can access files if it can use attributes of an arbitrary simple object, for example. zero number.

 (0).__class__.__base__.__subclasses__()[40]('/etc/pas'+'swd')

Index 40 is individual and very typical of Python 2.7, but the index of the <type 'file'> subclass can be easily found:

 [x for x in (1).__class__.__base__.__subclasses__()if'fi'+'le'in'%s'%x][0]( '/etc/pas'+'swd')

Any combination of whitelist and blacklist is unsafe and / or too restrictive. pypy sandbox is reliable without compromise:

... This subprocess can run arbitrary untrusted Python code, but all its I / O is serialized in the stdin / stdout pipe instead of being directly executed. The external process reads the pipe and decides whether the commands are allowed or not (sandbox) or even reinterprets them differently ...

In addition, a seccomp- based solution can be quite safe. ( blog )

I want to be sure that in the future the function will generate the same thing as today.

It is easy to write a function that has hard reproducible results , and it cannot be easily prevented:

 class A(object): "This can be any very simple class" def __init__(self, x): self.x = x def __repr__(self): return repr(self.x) def strange_function(): # You get a different result probably everytimes. return list(set(A(i) for i in range(20))) >>> strange_function() [1, 18, 12, 5, 16, 15, 8, 2, 14, 0, 6, 19, 13, 11, 10, 9, 17, 3, 7, 4] >>> strange_function() [0, 9, 14, 3, 17, 5, 6, 11, 8, 1, 15, 7, 12, 13, 2, 10, 16, 4, 19, 18]

... even if you delete everything that depends on time, a random number generator, an order based on a hash function, etc., it is also easy to write a function that sometimes exceeds the available memory or timeout limit and sometimes gives a result.

EDIT:
Roman, you recently wrote that you are sure that you can trust the user. Then there is a realistic solution. He should check the input and output of the function, write it to a file and check it on the virtual machine running the remote IPython notebook (a wonderful short instructional video, support for remote computing from the box, restarting the backend service via the web document menu from the browser in one second without data loss (input / output) in the laptop (html-document), because it is created dynamically step by step by our activity launching javascript, which causes the remote backend).

You do not need to be interested in internal calls, only global inputs and outputs, until you find the difference. The virtual machine should be able to independently verify the results and reproduce them. Configure the firewall that the machine accepts from you, but cannot initiate an outgoing connection. Configure the file system so that data cannot be saved by the current user, and therefore it is not available, except for software components. Disable database services. Check the input / output of the results in random order or start the two IPython laptop services on different ports and select a random backend for each command line in the laptop or restart the server process often before that is important. If you find a difference, debug your code and fix it.

You can automate it without a “laptop”, finally, only with remote IPython computers when you do not need interactivity.

+4

hynekcer Oct 14 '14 at 3:34

source share

What you want is called a sandbox or limited Python.

Both are mostly dead.

The closest to the functionality today http://pypy.readthedocs.org/en/latest/sandbox.html notes that the latest build is actually 3 years old.

+4

Dima tisnek Oct 15 '14 at 6:39

source share

PythonNut · Accepted Answer · 2014-10-14T02:51:34+0000

The answer is no. What you are looking for is a function that tests functional purity . But, as shown in this code, there is no way to guarantee that side effects are not actually caused.

 class Foo(object): def __init__(self, x): self.x = x def __add__(self, y): print("HAHAHA evil side effects here...") # proceed to read a file and do stuff return self # this looks pure... def f(x): return x + 1 # but really... >>> f(Foo(1)) HAHAHA evil side effects here...

Due to the comprehensive way, objects can redefine their behavior (accessing fields, calling, operator overloading, etc.), you can always pass an input that makes a clean function unclean. Therefore, the only pure functions are those that literally do nothing with their arguments ... a class of functions that is usually less useful.

Of course, if you can specify other restrictions, it will become easier.

How to ensure that python function generates its output based only on its input?

More articles: