How to save space before document element when parsing with Java?

In my application, I modify part of the XML files that start as follows:

<?xml version="1.0" encoding="UTF-8"?> <!-- $Id: version control yadda-yadda $ --> <myElement> ... 

Note the empty line before <myElement> . After loading, changing and saving, the result does not suit:

 <?xml version="1.0" encoding="UTF-8"?> <!-- $Id: version control yadda-yadda $ --><myElement> ... 

I found out that the space (one new line) between the comment and the node document is not represented at all in the DOM. The following stand-alone code faithfully reproduces the problem:

 String source = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>\n<!-- foo -->\n<empty/>"; byte[] sourceBytes = source.getBytes("UTF-16"); DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document doc = builder.parse(new ByteInputStream(sourceBytes, sourceBytes.length)); DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation(); LSSerializer lsSerializer = domImplementation.createLSSerializer(); System.out.println(lsSerializer.writeToString(doc)); // output: <?xml version="1.0" encoding="UTF-16"?>\n<!-- foo --><empty/> 

Does anyone have an idea how to avoid this? Essentially, I want the result to be the same as the input. (I know that the xml declaration will be restored because it is not part of the DOM, but that is not a problem.)

+7
java dom xml parsing whitespace
source share
5 answers

The main reason is that the standard DOM Level 3 cannot represent text nodes as children of a document without violating the specification. Spaces will be removed by any compatible parser.

 Document -- Element (maximum of one), ProcessingInstruction, Comment, DocumentType (maximum of one) 

If you need a standardized solution, and the goal is readability, not 100% reproduction, I would look for it in your output mechanism.

+2
source share

I had the same problem. My solution was to write my own XML parser: DecentXML

Main function: it can keep 100% of the original input, spaces, entities, everything. It will not bother you with the details, but if your code is to generate XML as follows:

  <element attr="some complex value" /> 

then you can.

+6
source share

Why do you want to avoid this?

White space outside of tags / elements is defined as insignificant by specification. It simply does not exist, since the information object is represented by your DOM.

Therefore, after DOM serialization, it will not be there again.

If you are in the process of developing something that relies on this empty line ... Don't do this.

+3
source share

In general, white spaces are considered inconsequential in XML and thus are not preserved when parsing an XML file. Most libraries that output XML have the ability to output it with good formatting and proper indentation, but it will always be pretty general. No "there is an extra line right here."

+1
source share

I agree with Chris and Tomalak, the empty string has nothing to do with the XML point of view. If your application needs to create an empty output line, I would suggest considering the need for this requirement.

In any case, if you still want this empty line to appear, I would suggest downloading the source code of the XML parser that you are using and changing this behavior. But keep in mind that this is not standard XML, and it will not be compatible with other applications.

0
source share

All Articles