Open iso-8859-1 html encoded with nokogiri messes up accents

I am trying to make some changes to an html page encoded with charset = iso-8859-1

doc = Nokogiri :: HTML (open (html_file))

puts doc.to_html will ruin all the accents on the page. Therefore, if I save it, it looks broken in the browser too.

I'm still on Rails 3.0.6 ... Any tips on how to fix this problem?

Here is one of the pages suffering from this, for example: http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html

I asked on Github, but I have a feeling that it will be faster. I will update both places if I receive treatment for this problem.

UPDATE 1 March 24, 2012

Thanks for the comments. I was able to partially solve this problem. I believe that this has nothing to do with Nokogiri. As I mentioned in some comment, I just need to open and save the file to break the accents.

Closest to the fix I received is doing the following:

thefile = File.open(html_file, "r") text = thefile.read doc = Nokogiri::HTML(text) ... do any stuff with nokogiri File.open(html_file, 'w') {|f| f.write(doc.to_html) } 

The source file comes with iso-8859-1, and save comes in utf-8, which looks fine. Accents on the spot. With the exception of access to the capital letter: -P I get question marks, as in Econom a, should be รญ (i with an accent)

Stepping closer, I think. If someone has a clue to cover with caps, this can almost be done.

Greetings.

+1
source share
1 answer

The method you used to download the file may have changed the encoding, violating the accents in the file. Try it to work correctly:

 require 'rubygems' require 'nokogiri' require 'open-uri' url = 'http://www.elmundo.es/accesible/elmundo/2012/03/07/solidaridad/1331108705.html' doc = Nokogiri::HTML(open(url)) File.open("1331108705.html", "w") {|f| f.write(doc.to_html)} system('open', '1331108705.html') # on Mac OS X, this will open the html file in your browser 

How did you upload the file?

0
source

All Articles