Does Nokigiri produce different hero results?

I have a very strange problem and I would appreciate it being tracked.

I am using a nokogiri gem to parse some html and I am parsing a file that has a weird character in it. Not quite sure what this character is, in vim it shows as ^ Q.

Everything works fine on my own computer, however on heroku it inserts </body></html><html> when it hits the character, and selectors only return elements before the strange character.

To illustrate: Nokogiri::HTML( open("http://thoms.net.nz/e2.html")).css("body div").count Count - 1 on the hero, and two on my computer. - A file containing this symbol can be downloaded from http://thoms.net.nz/e2.html .

Both my computer and heroics work nokogiri 1.5.5 with ruby ​​1.9.3.

+2
source share
1 answer

^Q is a software control (XON) that should not be in HTML. I suspect that his unexpected presence confuses both Nokogiri and Heroku, but in different ways.

HTML documents from wild places on the Internet can be corrupted in any number of ways. I saw all kinds of garbage in them, and if I couldn’t figure it out with iconv or Unicode transliteration, I would resort to a quick global search and replace to remove anything outside the normal ASCII range before processing.


In Ruby, global search and replace uses String#gsub .

 doc = Nokogiri::HTML(html.gsub("\u0011", '')) 
+2
source

Source: https://habr.com/ru/post/923976/


All Articles