File.read () multiprocessing and GIL

I read that some of the Python functions implemented in C, which I suppose include file.read (), can free the GIL while it is running, and then return it to completion, and thus use several cores if available.

I use multiprocess to parallelize some code, and currently I have three processes: the parent, one child, which reads the data from the file, and one child, which generates a checksum from the data passed to her by the first child to process.

Now, if I understand this right, it seems that creating a new process for reading a file, as I am doing now, is not mandatory, and I should just name it in the main process. The question is, do I understand this right and will I work better with reading stored in the main process or in a separate one?

Thus, my function reads and processes the processed data:

def read(file_path, pipe_out): with open(file_path, 'rb') as file_: while True: block = file_.read(block_size) if not block: break pipe_out.send(block) pipe_out.close() 

I believe this definitely uses multiple cores, but also introduces some overhead:

 multiprocess.Process(target=read, args).start() 

But now I'm wondering if this will just use multiple cores, minus the overhead:

 read(*args) 

Any understanding of any person regarding which it would be faster and for what reason would be highly appreciated!

+4
source share
2 answers

Well, as written in the comments, the actual question is:

Does (C)Python create threads on its own, and if so, how can I make use of that?

Short answer: No

But , the reason these C functions are nonetheless interesting to Python programmers is as follows. By default, two pieces of python code running in the same interpreter can be executed in parallel, this is due to an evil called Global Interpreter Lock, also known as GIL. The GIL is stored whenever the interpreter executes Python code, which implies the above statement, that no two pieces of python code can work in parallel in the same interpreter.

However, you can still use multithreading in python, namely when you do a lot of I / O or use a lot of external libraries like numpy, scipy, lxml etc. that everyone knows about the problem and release GIL whenever they can (i.e. whenever they don't need to interact with the python interpreter).

I hope this makes the problem a little easier.

+1
source

I think this is the main part of your question:

The question is, do I understand this right and will I be better at reading when stored in the main process or in a separate one?

I assume that your goal is to read and process the file as quickly as possible. In any case, reading a file is related to I / O binding, not to the CPU. You cannot process data faster than you can read it. Thus, file I / O clearly limits the performance of your software. You cannot increase the speed of reading data using parallel threads / processes to read files. Also, the "low" CPython does not. While you are reading a file in one process or stream (even in the case of CPython with its GIL stream low), you will receive as much data in time as you can get from the storage device. This is also great if you are reading a file in the main thread, while there are no other blocking calls that will actually slow down the reading of the file.

+2
source

All Articles