How to prevent a Scrapy url from a url url

I would like Scrapy not to encode the URLs of my requests. I see that scrapy.http.Request imports scrapy.utils.url, which imports w3lib.url, which contains the _ALWAYS_SAFE_BYTES variable. I just need to add the character set to _ALWAYS_SAFE_BYTES, but I'm not sure how to do this from my spider class.

scrapy.http.Request corresponding line:

fp.update(canonicalize_url(request.url)) 

canonicalize_url is scrapy.utils.url, the corresponding line in scrapy.utils.url:

 path = safe_url_string(_unquotepath(path)) or '/' 

safe_url_string () from w3lib.url, the corresponding lines in w3lib.url:

 _ALWAYS_SAFE_BYTES = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-') 

inside w3lib.url.safe_url_string ():

 _safe_chars = _ALWAYS_SAFE_BYTES + b'%' + _reserved + _unreserved_marks return moves.urllib.parse.quote(s, _safe_chars) 
+8
python url url-encoding web-crawler scrapy
source share
1 answer

I wanted not to code [ and ] , and I did it.

When creating Request object scrapy, some URL encoding methods are used. To return them, you can use your own middleware and change the URL for your needs.

You can use Downloader Middleware as follows:

 class MyCustomDownloaderMiddleware(object): def process_request(self, request, spider): request._url = request.url.replace("%5B", "[", 2) request._url = request.url.replace("%5D", "]", 2) 

Remember to β€œactivate” the middleware in settings.py as follows:

 DOWNLOADER_MIDDLEWARES = { 'so.middlewares.MyCustomDownloaderMiddleware': 900, } 

My project is named so , and in the folder there is a middlewares.py file. You must configure them in your environment.

Credit: Frank Martin

0
source share

All Articles