How to efficiently parse concatenated XML documents from a file

I have a file that consists of concatenated valid XML documents. I would like to effectively separate individual XML documents.

The contents of the concatenated file will look like this, so the concatenated file itself is not a valid XML document.

<?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> 

Each individual XML document is about 1-4 KB, but there are potentially several hundred of them. All XML documents follow the same XML schema.

Any suggestions or tools? I work in a Java environment.

Edit: I'm not sure if the xml declaration will be present in the documents or not.

Edit: Assume the encoding for all xml documents is UTF-8.

+7
java xml parsing
source share
5 answers

As Eamon says, if you know <? xml> thing will always be there, just break it.

Otherwise, find the tag of the destination document. That is, scan the text, counting how many levels you are on. Each time you see a tag starting with "<" but not "</", and it does not end with "/>", add 1 to the amount of depth. Each time you see a tag starting with "</", subtract 1. Each time you subtract 1, check to see if you now have zero. If so, you have reached the end of the XML document.

+3
source share

Do not break! Add one big tag around it! Then it will again become a single XML file:

 <BIGTAG> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> <?xml version="1.0" encoding="UTF-8"?> <someData>...</someData> </BIGTAG> 

Now, using / BIGTAG / SomeData, you get all the XML roots.


If processing instructions are on the way, you can always use RegEx to delete them. It is easier to simply delete all processing instructions than to use RegEx to find all the root nodes. If the encoding is different for all documents, then remember this: the entire document itself must be encoded with some type of encoding, so all of the XML documents that it includes will use the same encoding, regardless of what each header says you. If the large file is encoded as UTF-16, then it doesnโ€™t matter if the XML processing instructions say that XML itself is UTF-8. It will not be UTF-8 since the entire file is UTF-16. Therefore, the encoding in these XML processing instructions is invalid.

By combining them into one file, you changed the encoding ...


By RegEx, I mean regular expressions. You just need to delete all the text that is between <? and โ†’ which should not be too complicated with a regular expression and a bit more complicated if you try to use other string manipulation methods.
+4
source share

Since you are not sure that the ad will always be present, you can delete all ads (this can find a regular expression such as <\?xml version.*\?> ), Add <doc-collection> , append </doc-collection> so that the resulting string is a valid XML document. In it you can get individual documents using (for example) the XPath /doc-collection/* query. If the combined file can be large enough to make memory consumption a problem, you may need to use a streaming parser such as Sax, but the principle remains the same.

In a similar scenario that I came across, I just read the concatenated document directly using an XML parser: although the concatenated file cannot be a valid XML document, this is a valid xml fragment (prohibition of repeated declarations) - as soon as you split the declarations, if your parser supports fragment parsing, then you can also simply read the result directly. All top-level elements will then be the root elements of the concatenated documents.

In short, if you separate all the declarations, you will have a valid XML fragment that is trivially processed either directly or by surrounding it with some kind of tag.

+3
source share

This is my answer for C # version. very ugly code that works: - \

 public List<T> ParseMultipleDocumentsByType<T>(string documents) { var cleanParsedDocuments = new List<T>(); var serializer = new XmlSerializer(typeof(T)); var flag = true; while (flag) { if(documents.Contains(typeof(T).Name)) { var startingPoint = documents.IndexOf("<?xml"); var endingString = "</" +typeof(T).Name + ">"; var endingPoing = documents.IndexOf(endingString) + endingString.Length; var document = documents.Substring(startingPoint, endingPoing - startingPoint); var singleDoc = (T)XmlDeserializeFromString(document, typeof(T)); cleanParsedDocuments.Add(singleDoc); documents = documents.Remove(startingPoint, endingPoing - startingPoint); } else { flag = false; } } return cleanParsedDocuments; } public static object XmlDeserializeFromString(string objectData, Type type) { var serializer = new XmlSerializer(type); object result; using (TextReader reader = new StringReader(objectData)) { result = serializer.Deserialize(reader); } return result; } 
+1
source share

I don't have a Java answer, but here is how I solved this problem with C #.

I created a class called XmlFileStreams to scan the source document to declare an XML document and logically split it into several documents:

 class XmlFileStreams { List<int> positions = new List<int>(); byte[] bytes; public XmlFileStreams(string filename) { bytes = File.ReadAllBytes(filename); for (int pos = 0; pos < bytes.Length - 5; ++pos) if (bytes[pos] == '<' && bytes[pos + 1] == '?' && bytes[pos + 2] == 'x' && bytes[pos + 3] == 'm' && bytes[pos + 4] == 'l') positions.Add(pos); positions.Add(bytes.Length); } public IEnumerable<Stream> Streams { get { if (positions.Count > 1) for (int i = 0; i < positions.Count - 1; ++i) yield return new MemoryStream(bytes, positions[i], positions[i + 1] - positions[i]); } } } 

To use XmlFileStreams:

 foreach (Stream stream in new XmlFileStreams(@"c:\tmp\test.xml").Streams) { using (var xr = XmlReader.Create(stream, new XmlReaderSettings() { XmlResolver = null, ProhibitDtd = false })) { // parse file using xr } } 

There are a few caveats.

  • It reads the entire file into memory for processing. This can be a problem if the file is really large.
  • It uses simple brute force search to find the borders of an XML document.
0
source share

All Articles