Python requests do not give me the same HTML as my browser.

I grab a Wikia page using Python requests. The problem is that: requesting requests does not give me the same HTML as my browser with the same page.

For comparison, here the Firefox page gets me , and here is the request to select the page (download them to view - sorry, there is no easy way to just visually place some HTML from another site).

You will notice a few differences ( super unfriendly diff ). There are some little things, such as attributes, arranged in different ways and such, but there are also some very, very big things. Most important is the lack of the last six <img> s, as well as the entire navigation and footer section . Even in raw HTML, it looks like the page is sharply cropped.

Why is this happening, and is there a way to fix this? I already thought about a bunch of things, none of which were fruitful:

  • Request interference headers? No, I tried to copy the headers sent by my browser, User-Agent and everything, 1: 1 to the request, but nothing has changed.
  • Loading JavaScript content after loading HTML? Nope. Even with JS disabled, Firefox gives me a "good" page.
  • Well ... well ... what else could be?

It would be great if you knew how this could happen and how to fix it. Thanks!

+5
source share
3 answers

I had a similar problem:

  • Identical headers with Python and through the browser
  • JavaScript is definitely ruled out as a reason

To solve the problem, I ended up replacing the query library for urllib.request.

Basically, I replaced:

 import requests session = requests.Session() r = session.get(URL) 

from:

 import urllib.request r = urllib.request.urlopen(URL) 

and then he worked.

Maybe one of these libraries is doing something strange backstage? Not sure if this is an option for you or not.

+4
source

Many of the differences that I see show me that the content still exists, it just displays in a different order, sometimes with different steps.

You can get different content based on several different things:

  • Your headlines
  • Your user agent
  • Time!
  • The order that the web application decides to render elements on the page, obeying an arbitrary order of attributes, because the element can be retrieved from an unsorted data source.

If you could include all of your headings at the top of this Diff, then we can better understand this.

I suspect that the application chose not to display certain images, since they are not optimized for what, in his opinion, is a kind of robot / mobile device (Python requests)

Upon closer inspection of diff, it seems that everything was loaded in both requests, just with a different formatting.

0
source

I suggest you not send the correct header (or send it incorrectly) with your request. That is why you get different content. The following is an example HTTP request with a header:

 url = 'https://www.google.co.il/search?q=eminem+twitter' user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36' # header variable headers = { 'User-Agent' : user_agent } # creating request req = urllib2.Request(url, None, headers) # getting html html = urllib2.urlopen(req).read() 

If you are sure you are sending the correct header, but still getting different html. You can try using selenium . This will allow you to work directly with the browser (or phantomjs if your computer does not have a graphical interface). With selenium, you can just grab the html directly from the browser.

0
source

All Articles