Do not break! Add one big tag around it! Then it will again become a single XML file:
<BIGTAG> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> </BIGTAG>
Now, using / BIGTAG / SomeData, you get all the XML roots.
If processing instructions are on the way, you can always use RegEx to delete them. It is easier to simply delete all processing instructions than to use RegEx to find all the root nodes. If the encoding is different for all documents, then remember this: the entire document itself must be encoded with some type of encoding, so all of the XML documents that it includes will use the same encoding, regardless of what each header says you. If the large file is encoded as UTF-16, then it doesnโt matter if the XML processing instructions say that XML itself is UTF-8. It will not be UTF-8 since the entire file is UTF-16. Therefore, the encoding in these XML processing instructions is invalid.
By combining them into one file, you changed the encoding ...
By RegEx, I mean regular expressions. You just need to delete all the text that is between <? and โ which should not be too complicated with a regular expression and a bit more complicated if you try to use other string manipulation methods.
Wim ten brink
source share