`open_http ': 403 Forbidden (OpenURI :: HTTPError) for the string" Steve_Jobs ", but not for any other string

I went through the Ruby tutorials presented at http://ruby.bastardsbook.com/ and I came across the following code:

require "open-uri" remote_base_url = "http://en.wikipedia.org/wiki" r1 = "Steve_Wozniak" r2 = "Steve_Jobs" f1 = "my_copy_of-" + r1 + ".html" f2 = "my_copy_of-" + r2 + ".html" # read the first url remote_full_url = remote_base_url + "/" + r1 rpage = open(remote_full_url).read # write the first file to disk file = open(f1, "w") file.write(rpage) file.close # read the first url remote_full_url = remote_base_url + "/" + r2 rpage = open(remote_full_url).read # write the second file to disk file = open(f2, "w") file.write(rpage) file.close # open a new file: compiled_file = open("apple-guys.html", "w") # reopen the first and second files again k1 = open(f1, "r") k2 = open(f2, "r") compiled_file.write(k1.read) compiled_file.write(k2.read) k1.close k2.close compiled_file.close 

The code does not work with the following trace:

 /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:277:in `open_http': 403 Forbidden (OpenURI::HTTPError) from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:616:in `buffer_open' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:164:in `open_loop' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `catch' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:162:in `open_loop' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:132:in `open_uri' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:518:in `open' from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/open-uri.rb:30:in `open' from /Users/arkidmitra/tweetfetch/samecode.rb:11 

My problem is not that the code crashes, but that whenever I change r2 to anything other than Steve_Jobs, it works. What's going on here?

+8
ruby open-uri
source share
2 answers

I think this happens for locked records like Steve Jobs, Al Gore, etc. This is stated in the same book that you mean:

For some pages, such as El Gore's blocked entry, Wikipedia will not respond to the web request if the User Agent is not specified. The "user-agent" usually refers to your browser, and you can see this by checking the headers sent for any page request in your browser. By providing a User-Agent key-value pair (I mostly use Ruby) and it seems to work), we can pass it as a hash (I use the HEADERS_HASH constant in the example) as a second argument to the method call.

It is listed later at http://ruby.bastardsbook.com/chapters/web-crawling/

+2
source

Your code works fine for me (Ruby MRI 1.9.3) when I request a wiki page that exists.

When I request a wiki page that does NOT exist, I get a mediawiki 404 error code.

  • Steve_Jobs => success
  • Steve_Austin => success
  • Steve_Rogers => success
  • Error Steve_Foo =>

Wikipedia does a ton of caching, so if you see reponses for "Steve_Jobs" that are different from other people that really exist, then it's best to guess because Wikipedia caches an article by Steve Jobs because it is famous and potentially adds additional checks / checks to protect the article from quick changes, corrections, etc.

The solution for you: always open the URL using the User Agent string.

 rpage = open(remote_full_url, "User-Agent" => "Whatever you want here").read 

Information from Mediawiki docs: "When you make HTTP requests to the MediaWiki web service API, be sure to include the User-Agent header that identifies your client correctly. Do not use the User-Agent provided by your client library by default, but make up the user header. which includes the name and version number of your client: something like "MyCuteBot / 0.1".

In the Wikimedia wiki, if you do not supply the User-Agent header or do not supply the empty or general, your request will fail with HTTP error 403. See Our User-Agent policy. "

+9
source

All Articles