Debug Encoding Problems (R XML)

Is there a way to find the encoding problem in an XML file? I am trying to parse such a file (let it doc ) with an XML library in R , but there seems to be an encoding problem.

 xmlInternalTreeParse(doc, asText=TRUE) Error: Document labelled UTF-16 but has UTF-8 content. Error: Input is not proper UTF-8, indicate encoding! Error: Premature end of data in tag ... 

and a list of tags with supposedly premature termination of data. However, I am sure that this document does not have premature goals.

So try the following:

 doc <- iconv(doc, to="UTF-8") doc <- sub("utf-16", "utf-8", doc) xmlInternalTreeParse(doc, asText=T) Error: Premature end of data in tag... 

and again the list of tags follows along with line numbers. I checked the lines and I can not find any errors.

Another suspicion: the โ€œฮผโ€ character that appears in the document may cause an error. So the following attempt:

 doc <- iconv(doc, to="UTF-8") doc <- gsub("ยต", "micro", doc) doc <- sub("utf-16", "utf-8", doc) xmlInternalTreeParse(doc, asText=T) Error: Premature end of data in tag... 

Any other debugging suggestions?

EDIT: after two days trying to fix the error, I still could not find a solution. However, I think I narrowed down the possible answers. Here is what I found:

  • copy the XML string from the source database to a file and save it as a separate XML file in Notepad ++ โ†’ Document labelled UTF-16 but has UTF-8 content .

  • changing <?xml version="1.0" encoding="utf-16"?> to <?xml version="1.0" encoding="utf-8"?> (or encoding="latin1" ) in this file -> no errors

  • reading an XML string from the database using doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1] doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1] , manipulating it with str_sub(doc, 35, 36) <- "8" or str_sub(doc, 31, 36) <- "latin1" , and then trying to xmlInternalTreeParse(doc) it with xmlInternalTreeParse(doc) โ†’ Premature end of data in tag...

  • reading an XML string from the database as described above, and then trying to xmlInternalTreeParse(doc) it using xmlInternalTreeParse(doc) Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... (list of tags follows).

  • reading an XML string from the database as described above and parsing with xmlInternalTreeParse(doc, encoding="latin1") Premature end of data in tag...

  • using doc <- iconv(doc[1,1], to="UTF-8") or to="latin1" before parsing changes anything

I am very grateful for any suggestions.

+3
source share
1 answer

The encoding problem arose because the encoding of the source XML file and the encoding in the SQL database, where the XML content was stored as longtext , did not match. Substituting the encoding specification into an XML string and converting this string, the problem was solved:

 doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE) doc <- iconv(doc[1,1], to="UTF-8") doc <- sub("utf-16", "utf-8", doc) doc <- xmlInternalTreeParse(doc, asText = TRUE) 

XML string truncation during extraction from the database has been a separate issue. The solution is here: How to get a very long XML string from an SQL database with R?

+3
source

All Articles