HTML DOM Parse and character encoding in XMLHTTPRequest with Firefox extension

Now I am writing firefox 4 boot extension.


Here is my story:

When I use @ mozilla.org / xmlextras / xmlhttprequest; 1, nsIXMLHttpRequest , the contents of the destination URL can be successfully loaded req.responseText.

I parsed the responseText for the DOM using the createElement method and innerHTML property into a BODY element.

Everything seems successful.

However, there is a problem with character encoding (charset).

Since I need the extension to detect the encoding of the target documents, override the Mine request type with the text / html; charset = blahblah .. doesn't seem to fit my need.

I tried @ mozilla.org / intl / utf8converterservice; 1, nsIUTF8ConverterService , but it looks like XMLHTTPRequest does not have a ScriptableInputStream or even any InputStream or readable stream.

I don’t know how to read the contents of the target document in a suitable, automatically detected encoding, regardless of the auto-detection function of the character encoding in the GUI or the encoding read in the main meta tag of the content document.


EDIT: Would it be practical if I parsed the entire document, including the HTML, HEAD, BODY tag, into a DOM object, but without loading an extensive document such as js, css, ico files?

EDIT: A method in an MDC article called " HTML to DOM " that uses @ mozilla.org / feed-unescapehtml; 1, nsIScriptableUnescapeHTML is unacceptable because it analyzed a lot of errors and the error with baseURI cannot be set to type text / html . All HREF attributes in A elements are omitted when it contains a relative path .

EDIT # 2: It would be nice if there were any methods that could convert the incoming Text response to readable UTF-8 encoding strings. :-)


Any ideas or work on solving the encoding problem are welcome. :-)

PS. target documents are universal , so there is no defined charset (or ... preknown ), and, of course, not only UTF8, as it is already defined by default.


SUPP:

So far, I have two brief basic ideas for solving this problem.

Can someone help me work with XPCOM module and method names?


In Specify the encoding when parsing content in the DOM.

We need to first find out the encoding of the document (by extracting the head meta tag or title). Then

  • Find out a method that can determine the encoding when analyzing body contents.
  • Find a method that can analyze both the head and body.

In Convert or Make an Incoming Reply Text to / will be UTF-8, so parsing the default DOM element with UTF-8 encoding still works.

X seems impractical and reasonable: redefining Mine with a character set is an implementation of this idea, but we cannot predict the encoding before starting the request.

+4
source share
1 answer

There seems to be no other answer anymore.

After a day of testing, I found that there was a way (albeit awkward) to solve my problem.

xhr.overrideMimeType('text/plain; charset=x-user-defined'); where xhr stands for XMLHttpRequest handler.

To make Firefox treat it like plain text using a custom character set. This tells Firefox not to parse this, and to let the bytes go through the raw ones.

Applies to MDC Document: Using_XMLHttpRequest # Receiving_binary_data

And then use Unicode Converter for scripts : @ mozilla.org / intl / scriptableunicodeconverter, nsIScriptableUnicodeConverter

The charset variable can be retrieved from the header meta tag regardless of the regular expression from req.responseText (with unknown encoding) or another method.

 var unicodeConverter = Components.classes["@mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter); unicodeConverter.charset = charset; str = unicodeConverter.ConvertToUnicode(str); 

Finally, a unicode string is created, as well as the UTF-8 family. :-)

Then just parse the body element and fulfill my need.

Other brilliant ideas are still welcome. Feel free to dispute my answer for a good reason. :-)

+1
source

All Articles