Open-uri returns ASCII-8BIT from a web page encoded in iso-8859

I use open-uri to read a webpage which is allegedly encoded in iso-8859-1. When I read the contents of the page, open-uri returns a string encoded in ASCII-8BIT.

open("http://www.nigella.com/recipes/view/DEVILS-FOOD-CAKE-5310") {|f| p f.content_type, f.charset, f.read.encoding }
 => ["text/html", "iso-8859-1", #<Encoding:ASCII-8BIT>] 

I assume this is because there is a byte (or character) \ x92 on the web page that is not a valid iso-8859 character. http://en.wikipedia.org/wiki/ISO/IEC_8859-1 .

I need to store web pages as utf-8 encoded files. Any ideas on how to handle a webpage where the encoding is incorrect. I could catch the exception and try to guess the correct encoding, but it seems cumbersome and error prone.

+5
source share
1 answer
  • ASCII-8BIT is an alias for BINARY
  • open-uridoes the funny thing: if the file is less than 10 KB (or something like that), it returns String, and if it is larger, then it returns StringIO. This can be confusing if you are trying to solve encoding problems.

If the files are not huge, I would recommend manually loading them into lines:

require 'uri'
require 'net/http'
require 'net/https'

uri = URI.parse url_to_file

http = Net::HTTP.new(uri.host, uri.port)
if uri.scheme == 'https'
  http.use_ssl = true
  # possibly useful if you see ssl errors
  # http.verify_mode = ::OpenSSL::SSL::VERIFY_NONE
end
body = http.start { |session| session.get uri.request_uri }.body

Then you can use https://rubygems.org/gems/ensure-encoding gem

require 'ensure/encoding'
utf8_body = body.ensure_encoding('UTF-8', :external_encoding => :sniff, :invalid_characters => :transcode)

I was very pleased ensure-encoding... we use it in production at http://data.brighterplanet.com

Please note that you can also say :invalid_characters => :ignoreinstead :transcode.

, -, :external_encoding => 'ISO-8859-1' :sniff

+6

All Articles