Python urllib3 and proxy

I am trying to figure out how to use proxies and multithreading.

This code works:

requester = urllib3.PoolManager(maxsize = 10, headers = self.headers) thread_pool = workerpool.WorkerPool() thread_pool.map(grab_wrapper, [item['link'] for item in products]) thread_pool.shutdown() thread_pool.wait() 

Then in grab_wrapper

 requested_page = requester.request('GET', url, assert_same_host = False, headers = self.headers) 

Headers consist of: Accept, Accept-Charset, Accept-Encoding, Accept-Language and User-Agent

But this does not work in production, since it must transmit a proxy server, authorization is not required.

I tried different things (passing proxies for the request, in headers, etc.). The only thing that works is this:

 requester = urllib3.proxy_from_url(self._PROXY_URL, maxsize = 7, headers = self.headers) thread_pool = workerpool.WorkerPool(size = 10) thread_pool.map(grab_wrapper, [item['link'] for item in products]) thread_pool.shutdown() thread_pool.wait() 

Now that I have launched the program, it will make 10 requests (10 threads), and then ... will stop. No errors, no warnings. This is the only way to bypass the proxy server, but it seems that it cannot use proxy_from_url and WorkerPool together.

Any ideas on how to combine these two into working code? I would prefer not to rewrite it in patchwork, etc. Due to time limit

Hello

+4
source share
2 answers

It seems you are discarding the result of calling thread_pool.map() Try assigning it to a variable:

 requester = urllib3.proxy_from_url(PROXY, maxsize=7) thread_pool = workerpool.WorkerPool(size=10) def grab_wrapper(url): return requester.request('GET', url) results = thread_pool.map(grab_wrapper, LINKS) thread_pool.shutdown() thread_pool.wait() 

Note: If you are using python 3.2 or more, you can use concurrent.futures.ThreadPoolExecutor . This quote is similar to workerpool , but is included in the standard library.

+2
source

first of all, I would suggest avoiding urllib like a plague, and instead use requests that really simplify proxy support: http://docs.python-requests.org/en/latest/user/advanced/#proxies
In addition, I do not use it with multi-threaded, but with multi-processor processing, and it works very well, the only thing you need to find out is to have a dynamic queue or a fairly fixed list that you can distribute over the workers, an example of the latter, which is evenly distributed distributes a list of URLs to processes x:

 # *** prepare multi processing nr_processes = 4 chunksize = int(math.ceil(total_nr_urls / float(nr_processes))) procs = [] # *** start up processes for i in range(nr_processes): start_row = chunksize * i end_row = min(chunksize * (i + 1), total_nr_store) p = multiprocessing.Process( target=url_loop, args=(start_row, end_row, str(i), job_id_input)) procs.append(p) p.start() # *** Wait for all worker processes to finish for p in procs: p.join() 

each url_loop process writes its own datasets to tables in the database, so I don't have to worry about joining them together in python.

Edit: when sharing data between processes -> For details, see: http://docs.python.org/2/library/multiprocessing.html?highlight=multiprocessing#multiprocessing

 from multiprocessing import Process, Value, Array def f(n, a): n.value = 3.1415927 for i in range(len(a)): a[i] = -a[i] if __name__ == '__main__': num = Value('d', 0.0) arr = Array('i', range(10)) p = Process(target=f, args=(num, arr)) p.start() p.join() print num.value print arr[:] 

But, as you can see, basically these special types (Value and Array) allow you to share data between processes. If instead you are looking for a queue to make a process similar to roundrobin, you can use JoinableQueue. Hope this helps!

+3
source

All Articles