I would like Scrapy not to encode the URLs of my requests. I see that scrapy.http.Request imports scrapy.utils.url, which imports w3lib.url, which contains the _ALWAYS_SAFE_BYTES variable. I just need to add the character set to _ALWAYS_SAFE_BYTES, but I'm not sure how to do this from my spider class.
scrapy.http.Request corresponding line:
fp.update(canonicalize_url(request.url))
canonicalize_url is scrapy.utils.url, the corresponding line in scrapy.utils.url:
path = safe_url_string(_unquotepath(path)) or '/'
safe_url_string () from w3lib.url, the corresponding lines in w3lib.url:
_ALWAYS_SAFE_BYTES = (b'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-')
inside w3lib.url.safe_url_string ():
_safe_chars = _ALWAYS_SAFE_BYTES + b'%' + _reserved + _unreserved_marks return moves.urllib.parse.quote(s, _safe_chars)
python url url-encoding web-crawler scrapy
flyingtriangle
source share