Urllib.request: any way to read from it without changing the request object?

Question

Urllib.request: any way to read from it without changing the request object?

For the standard urllib.request object obtained as follows:

 req = urllib.urlopen('http://example.com')

If I read its contents through req.read() , then the request object will be empty.

Unlike regular file objects, the request object does not have a seek method, as I am sure these are great reasons.

However, in my case, I have a function, and I want it to make certain definitions about the request and then return that request “unharmed” so that it can be read again.

I understand that one option is to re-query. But I would like to be able to avoid multiple HTTP requests for the same URL and content.

The only alternative I can think of is that the function returns a tuple of the extracted content and the request object, with the understanding that everything that calls this function will need to get the content in this way.

Is this my only option?

+6

python urllib

Jordan reiter Apr 17 '13 at 18:36

source share

2 answers

Create a subclass of urllib2.Request that uses cStringIO.StringIO to store all cStringIO.StringIO . Then you can implement seek and so on. In fact, you can just use a string, but that will be more work.

+2

kindall Apr 17 '13 at 18:45

source share

Bakuriu · Accepted Answer · 2013-04-17T18:47:37+0000

Pass the caching to a StringIO object (code not tested, just to give an idea):

 import urllib from io import StringIO class CachedRequest(object): def __init__(self, url): self._request = urllib.urlopen(url) self._content = None def __getattr__(self, attr): # if attr is not defined in CachedRequest, then get it from # the request object. return getattr(self._request, attr) def read(self): if self._content is None: content = self._request.read() self._content = StringIO() self._content.write(content) self._content.seek(0) return content else: return self._content.read() def seek(self, i): self._content.seek(i)

If the code really expects the real Request object (i.e., calls isinstance to check the type), then the Request subclass and you don’t even need to implement __getattr__ .

Note that it is possible that the function checks the exact class (in which case you can’t do anything) or, if it is written in C, calls the method using C / API calls (in this case, the overridden method will not be called).

Urllib.request: any way to read from it without changing the request object?

More articles: