How to remove accent characters from InputStream

I am trying to parse a Rss2.0 feed on Android using the Pull parser.

XmlPullParser parser = Xml.newPullParser();
parser.setInput(url.open(), null);

The prolog for the XML feed file indicates that the encoding is "utf-8". When I open a remote stream and transfer it to my Pull Parser, I get an invalid token, incorrectly executed documents.

When I save the XML file and open it in the browser (FireFox), the browser reports the presence of the Unicode 0x12 character (serious accent?) In the file and cannot display the XML.

What is the best way to handle such cases, assuming that I do not control the returned XML?

Thank.

+5
source share
5 answers

, 0x12 ? UTF-8 0x00-0x7F, , ASCII, ASCII 0x12 , DC2 CTRL + R.

. - , . , :

  • (BOM) XML
  • XML , UTF-8, , .
  • XML, firefox . , XML , 0x9, 0xA 0xD , 0x20, 0x12 .

pastebin , .

EDIT: , . .

XML, , - , - , , .

, , , - , ? (SMS) 7- . 0x92 (ASCII forward tick/apostrophe - ?) 0x12. , , , .

, , :

  • , "UTF-8" setInput, .

  • , , . "UTF-8" - "iso-8859-1" "UTF-16". java Sun site - . ( Android.)

  • , . 0x20, (0x9,0xA 0xD - .) , .

class ReplacingInputStream extends FilterInputStream
{
   public int read() throws IOException
   {
      int read = super.read();
      if (read!=-1 && read<0x20 && !(read==0x9 || read==0xA || read==0xB))
         read = 0x20;
      return read;          
   }
}

. , XML XML, , .

+6

, , . , .

<title>My title</title>
<link>http://mylink.com</link>
<description>My description</description>

<title><![CDATA[My title]]></title>
<link><![CDATA[http://milynk.com]]></link>
<description><![CDATA[My Description]]></description>

. , , .

+2

UTF-8 , . , , (, , , ,...). , . , :

  • MSB (, 7- ASCII).
  • : 110xxxxx 10xxxxxx
  • : 1110xxxx 10xxxxxx 10xxxxxx
  • : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

, , UTF-8 ( XML), - UTF-8 ( - UTF-8, , Cp1252). XML- UTF-8, -, ( ). : 110xxxxx 10xxxxxx ( , 01xxxxxx 11xxxxxx 00xxxxxx, ).

, . XML, Windows-1252, ANSI, , -ASCII- ( > 127) .


:

, , ASCII ( , ), 2 XML ASCII- 8- (ANSI, Windows-XXXX, Mac-Roman ..). :

XmlPullParser parser = Xml.newPullParser();
parser.setInput(url.open(), "ISO-8859-1");
+2

setInput(istream, null) , pull . , , , - , . , - , , .

, , , , , , try/catch. , , ISO-8859-1. , , .

+1
source

Before parsing XML, you can configure it and manually remove the accents before parsing it. This may not be the best solution, but it will do the job.

0
source

All Articles