Hebrew characters processed by HTML Tidy turn into nonsense

Question

Hebrew characters processed by HTML Tidy turn into nonsense

I use Tidy Online HTML ( http://infohound.net/tidy/ ) to remove some very old and confused HTML file containing several Hebrew characters. Whenever a page is processed by Tidy, the output turns Hebrew characters into gibberish even after changing the encoding methods in the settings. Using different settings, I manage to get the same result with Hebrew characters as unicode objects. I googled around for a possible solution, but did not find any. I had a few ideas, but I don’t know exactly how to approach them, if at all (maybe someone has a better solution).

I thought, maybe I can (after processing the page) scan the page for Unicode entities and replace them with the corresponding Hebrew characters (systematically, of course).
Perhaps I can take the Tidy HTML source code and modify it to output Hebrew characters accordingly. The problem is that I doubt that I am knowledgeable enough to even start something like that.

+4

unicode htmltidy tidy hebrew

Charles Jul 28 '11 at 15:36

source share

2 answers

The website http://infohound.net/tidy/ that you are using contains the "Char encoding" sentence in the lower right corner. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. For example, in Notepad ++ you can go to Encoding > Convert to UTF-8 without BOM .

0

Ynhockey Oct 29 '13 at 15:22

source share

Jake · Accepted Answer · 2012-03-30T08:48:18+0000

I had a similar problem. A document in UTF-8 containing Unicode characters. HTML Tidy turned them into HTML objects. This in HTMLTIDY.CFG fixed this:

char-encoding: utf8 input-encoding: utf8 output-encoding: utf8

Hope this helps.

Hebrew characters processed by HTML Tidy turn into nonsense

More articles: