Hebrew characters processed by HTML Tidy turn into nonsense

I use Tidy Online HTML ( http://infohound.net/tidy/ ) to remove some very old and confused HTML file containing several Hebrew characters. Whenever a page is processed by Tidy, the output turns Hebrew characters into gibberish even after changing the encoding methods in the settings. Using different settings, I manage to get the same result with Hebrew characters as unicode objects. I googled around for a possible solution, but did not find any. I had a few ideas, but I don’t know exactly how to approach them, if at all (maybe someone has a better solution).

  • I thought, maybe I can (after processing the page) scan the page for Unicode entities and replace them with the corresponding Hebrew characters (systematically, of course).
  • Perhaps I can take the Tidy HTML source code and modify it to output Hebrew characters accordingly. The problem is that I doubt that I am knowledgeable enough to even start something like that.
+4
source share
2 answers

I had a similar problem. A document in UTF-8 containing Unicode characters. HTML Tidy turned them into HTML objects. This in HTMLTIDY.CFG fixed this:

char-encoding: utf8 input-encoding: utf8 output-encoding: utf8 

Hope this helps.

+2
source

The website http://infohound.net/tidy/ that you are using contains the "Char encoding" sentence in the lower right corner. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. For example, in Notepad ++ you can go to Encoding > Convert to UTF-8 without BOM .

0
source

All Articles