Scrapy gets NoneType Error while using Privoxy Proxy for Tor

Question

Scrapy gets NoneType Error while using Privoxy Proxy for Tor

I am using Ubuntu 14.04 LTS.

I tried Polipo, but it continued to refuse Firefox connections, even if I added myself as allowedClient and hours of research without a solution. So instead, I installed Privoxy, and I confirmed that it works with Firefox by going to the Tor site, and he said that this browser is configured to use Tor. This confirms that I could clear Tor sites.

However, when I used Scrapy, I get an error message that no one has ...?

2016-07-14 02:43:34 [scrapy] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'myProject.middlewares.RandomUserAgentMiddleware', 'myProject.middlewares.ProxyMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2016-07-14 02:43:34 [scrapy] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2016-07-14 02:43:34 [scrapy] INFO: Enabled item pipelines: ['myProject.pipelines.MysqlPipeline'] 2016-07-14 02:43:34 [scrapy] INFO: Spider opened 2016-07-14 02:43:34 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-07-14 02:43:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-07-14 02:43:34 [Tor] DEBUG: User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 <GET http://thehiddenwiki.org> 2016-07-14 02:43:34 [scrapy] ERROR: Error downloading <GET http://thehiddenwiki.org> Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/usr/local/lib/python2.7/dist-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred result = f(*args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request return handler.download_request(request, spider) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 60, in download_request return agent.download_request(request) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 259, in download_request agent = self._get_agent(request, timeout) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/handlers/http11.py", line 239, in _get_agent _, _, proxyHost, proxyPort, proxyParams = _parse(proxy) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 37, in _parse return _parsed_url_args(parsed) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 20, in _parsed_url_args host = b(parsed.hostname) File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda> b = lambda s: to_bytes(s, encoding='ascii') File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 117, in to_bytes 'object, got %s' % type(text).__name__) TypeError: to_bytes must receive a unicode, str or bytes object, got NoneType

I searched for this "to_byte" error, but I will go to the Scrapy source code.

I know that this code works without a proxy server, because it cleans my localhost site and other websites, but not Tor, because it needs a proxy server to access the websites.

What's happening?

Middlewares.py

 class RandomUserAgentMiddleware(object): def process_request(self, request, spider): ua = random.choice(settings.get('USER_AGENT_LIST')) if ua: request.headers.setdefault('User-Agent', ua) #this is just to check which user agent is being used for request spider.log( u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request), level=log.DEBUG ) class ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = settings.get('HTTP_PROXY')

Settings.py

 USER_AGENT_LIST = [ 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7', 'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10' ] DOWNLOADER_MIDDLEWARES = { 'myProject.middlewares.RandomUserAgentMiddleware': 400, 'myProject.middlewares.ProxyMiddleware': 410, #'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None # Disable compression middleware, so the actual HTML pages are cached } HTTP_PROXY = 'localhost:8118'

+5

python proxy scrapy polipo

Arrow Jul 14 '16 at 15:47

source share

1 answer

paul trmbrth · Accepted Answer · 2016-07-15T14:48:01+0000

Internally, Scrapy uses urllib(2) _parse_proxy to determine proxy settings. From urllib docs :

The urlopen () function works transparently with proxies for which authentication is not required. On a Unix or Windows environment, set the http_proxy or ftp_proxy environment variables to the URL that identifies the proxy server before starting the Python interpreter.

 % http_proxy="http://www.someproxy.com:3128" % export http_proxy % python ...

And when using proxy in meta Scrapy, the same syntax expects, that is, it should contain a scheme, for example, 'http://localhost:8118' .

This is in the docs , albeit a bit hasty:

You can also set a proxy meta key for each request, for example, http://some_proxy_server:port .

Scrapy gets NoneType Error while using Privoxy Proxy for Tor

More articles: