I am completely new to multiprocessing. I read the documentation on the multiprocessor module. I read about the pool, threads, queues, etc., but I completely lost.
What I want to do with multiprocessing is that we convert my humble HTTP downloader to work with multiple workers. What I'm doing at the moment is to load the page, parse it into a page to get interesting links. Continue until all interesting links load. Now I want to implement this using multiprocessing. But at the moment I have no idea how to organize this workflow. I had two thoughts on this. Firstly, I was thinking of two lines. One is for links that need to be downloaded, and another is for links that need to be analyzed. One worker, loads the pages and adds them to the queue, which is designed for the elements that need to be analyzed. And another process parses the page and adds links,which he finds interesting for another line. The problems that I expect from this approach; First of all, why load one page at a time and analyze the page at a time. Moreover, as one process knows that there are elements that will be added to the queue later, after it has exhausted all the elements from the queue.
Another approach that I was thinking about using is this. There is a function that can be called with a URL as an argument. This function loads the document and begins to sort it by links. Each time he encounters an interesting link, he instantly creates a new thread that performs an identical function as himself. The problem that I am encountering in this approach is how can I track all the processes spawned everywhere and how can I find out if there are more processes to start. As well as how to limit the maximum number of processes.
So, I am completely lost. Can anyone suggest a good strategy and maybe show some code examples on how to go with the idea.
yasar source
share