Is there a way to find the encoding problem in an XML file? I am trying to parse such a file (let it doc ) with an XML library in R , but there seems to be an encoding problem.
xmlInternalTreeParse(doc, asText=TRUE) Error: Document labelled UTF-16 but has UTF-8 content. Error: Input is not proper UTF-8, indicate encoding! Error: Premature end of data in tag ...
and a list of tags with supposedly premature termination of data. However, I am sure that this document does not have premature goals.
So try the following:
doc <- iconv(doc, to="UTF-8") doc <- sub("utf-16", "utf-8", doc) xmlInternalTreeParse(doc, asText=T) Error: Premature end of data in tag...
and again the list of tags follows along with line numbers. I checked the lines and I can not find any errors.
Another suspicion: the โฮผโ character that appears in the document may cause an error. So the following attempt:
doc <- iconv(doc, to="UTF-8") doc <- gsub("ยต", "micro", doc) doc <- sub("utf-16", "utf-8", doc) xmlInternalTreeParse(doc, asText=T) Error: Premature end of data in tag...
Any other debugging suggestions?
EDIT: after two days trying to fix the error, I still could not find a solution. However, I think I narrowed down the possible answers. Here is what I found:
copy the XML string from the source database to a file and save it as a separate XML file in Notepad ++ โ Document labelled UTF-16 but has UTF-8 content .
changing <?xml version="1.0" encoding="utf-16"?> to <?xml version="1.0" encoding="utf-8"?> (or encoding="latin1" ) in this file -> no errors
reading an XML string from the database using doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1] doc <- sqlQuery(myconn, query.text, stringsAsFactors = FALSE); doc <- doc[1,1] , manipulating it with str_sub(doc, 35, 36) <- "8" or str_sub(doc, 31, 36) <- "latin1" , and then trying to xmlInternalTreeParse(doc) it with xmlInternalTreeParse(doc) โ Premature end of data in tag...
reading an XML string from the database as described above, and then trying to xmlInternalTreeParse(doc) it using xmlInternalTreeParse(doc) Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... Document labelled UTF-16 but has UTF-8 content. Input is not proper UTF-8, indicate encoding ! Bytes: 0xE4 0x64 0x2E 0x20 Premature end of data in tag... (list of tags follows).
reading an XML string from the database as described above and parsing with xmlInternalTreeParse(doc, encoding="latin1") Premature end of data in tag...
using doc <- iconv(doc[1,1], to="UTF-8") or to="latin1" before parsing changes anything
I am very grateful for any suggestions.