Best way to encode text data for XML in Java?

Question

Best way to encode text data for XML in Java?

Very similar to this question , with the exception of Java.

What is the recommended way to encode strings for XML output in Java. Strings can contain characters like "&", "<", etc.

+82

java xml encoding

Epaga Jan 13 '09 at 15:15

source share

20 answers

As already mentioned, using an XML library is the easiest way. If you want to escape, you can look at StringEscapeUtils from the Apache Commons Lang Library .

+112

Fabian Steeg Jan 13 '09 at 15:53

source share

Just use it.

 <![CDATA[ your text here ]]>

This will allow you to use any characters except the ending

]]>

That way you can include characters that would be illegal, such as and>. For example.

 <element><![CDATA[ characters such as & and > are allowed ]]></element>

However, attributes must be escaped because you cannot use CDATA blocks for them.

+18

ng. Jan 13 '09 at 15:48

source share

Try the following:

 String xmlEscapeText(String t) { StringBuilder sb = new StringBuilder(); for(int i = 0; i < t.length(); i++){ char c = t.charAt(i); switch(c){ case '<': sb.append("&lt;"); break; case '>': sb.append("&gt;"); break; case '\"': sb.append("&quot;"); break; case '&': sb.append("&amp;"); break; case '\'': sb.append("&apos;"); break; default: if(c>0x7e) { sb.append("&#"+((int)c)+";"); }else sb.append(c); } } return sb.toString(); }

+14

Pointer Null Apr 05 2018-12-12T00:

source share

This worked for me to provide an escaped version of the text string:

 public class XMLHelper { /** * Returns the string where all non-ascii and <, &, > are encoded as numeric entities. Ie "&lt;A &amp; B &gt;" * .... (insert result here). The result is safe to include anywhere in a text field in an XML-string. If there was * no characters to protect, the original string is returned. * * @param originalUnprotectedString * original string which may contain characters either reserved in XML or with different representation * in different encodings (like 8859-1 and UFT-8) * @return */ public static String protectSpecialCharacters(String originalUnprotectedString) { if (originalUnprotectedString == null) { return null; } boolean anyCharactersProtected = false; StringBuffer stringBuffer = new StringBuffer(); for (int i = 0; i < originalUnprotectedString.length(); i++) { char ch = originalUnprotectedString.charAt(i); boolean controlCharacter = ch < 32; boolean unicodeButNotAscii = ch > 126; boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>'; if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) { stringBuffer.append("&#" + (int) ch + ";"); anyCharactersProtected = true; } else { stringBuffer.append(ch); } } if (anyCharactersProtected == false) { return originalUnprotectedString; } return stringBuffer.toString(); } }

+13

Thorbjørn Ravn Andersen Jan 13 '09 at 19:00

source share

StringEscapeUtils.escapeXml() does not exit control characters (<0x20). XML 1.1 allows you to manage characters; In XML 1.0, no. For example, XStream.toXML() will happily serialize Java object control characters into XML, which the XML 1.0 parser will reject.

To avoid character control with Apocal commons-lang, use

 NumericEntityEscaper.below(0x20).translate(StringEscapeUtils.escapeXml(str))

+7

Steve Mitchell Aug 31 '12 at 1:30

source share

While idealism says it uses an XML library, IMHO, if you have a basic idea of XML, then common sense and performance say the template is complete. This is perhaps more readable. Although using library acceleration algorithms is probably a good idea.

Consider this: XML was supposed to be written by people.

Use libraries to generate XML when your XML as an “object” models your problem better. For example, if plug-ins are involved in the process of building this XML.

Edit: how to actually avoid XML in templates, using CDATA or escapeXml(string) from JSTL are two good solutions, escapeXml(string) can be used as follows:

 <%@taglib prefix="fn" uri="http://java.sun.com/jsp/jstl/functions"%> <item>${fn:escapeXml(value)}</item>

+6

Amr Mostafa May 19, '10 at 7:00 a.m.

source share

The behavior of StringEscapeUtils.escapeXml () has changed from Commons Lang 2.5 to 3.0. Now it no longer goes beyond Unicode characters greater than 0x7f.

This is good, the old method should have aimed a bit at deleting objects that can simply be inserted into a utf8 document.

The new Esperants that will be included in Google Guava 11.0 also seem promising: http://code.google.com/p/guava-libraries/issues/detail?id=799

+6

Jasper Krijgsman Dec 01 '11 at 17:42

source share

 public String escapeXml(String s) { return s.replaceAll("&", "&amp;").replaceAll(">", "&gt;").replaceAll("<", "&lt;").replaceAll("\"", "&quot;").replaceAll("'", "&apos;"); }

+6

iCrazybest Sep 16 '14 at 9:56 on

source share

Note. Your question is about escaping, not coding. Escaping uses <, etc., to allow the parser to distinguish between "this is an XML command" and "this is some kind of text." An encoding is the material that you specify in the XML header (UTF-8, ISO-8859-1, etc.).

First of all, like everyone else, use the XML library. XML looks simple, but encoding + escaping is dark voodoo (which you'll notice as soon as you see a collision with umlauts and Japanese and other weird things like “ full width digits ” (& # FF11; 1)). Storing XML data for reading is a Sisyphus task.

I suggest never trying to be smart at coding and escaping text in XML. But do not let this stop you from trying; just remember when he will bite you (and he will).

However, if you use only UTF-8 to make reading more understandable, you can consider this strategy:

If the text contains '<', '>' or '&', wrap it in <![CDATA[ ... ]]>
If the text does not contain these three characters, do not warp it.

I use this in an SQL editor, and it allows developers to cut and paste SQL from a third-party SQL tool into XML, without worrying about escaping. This works because SQL cannot contain umlauts in our case, so I'm safe.

+5

Aaron Digulla Jan 13 '09 at 16:11

source share

This question is eight years old and still not quite the right answer! No, you do not need to import the entire third-party API to complete this simple task. Bad advice

The following method will be:

correctly handle characters outside the base multilingual plane
XML requires escape characters
escape any non-ASCII characters, which is optional, but usually
replace invalid characters in XML 1.0 with a Unicode replacement character. There is no better option - their removal is also true.

I tried to optimize the work for the most common case, while ensuring that you can pass / dev / random through it and get the correct string in XML.

 public static String encodeXML(CharSequence s) { StringBuilder sb = new StringBuilder(); int len = s.length(); for (int i=0;i<len;i++) { int c = s.charAt(i); if (c >= 0xd800 && c <= 0xdbff && i + 1 < len) { c = ((c-0xd7c0)<<10) | (s.charAt(++i)&0x3ff); // UTF16 decode } if (c < 0x80) { // ASCII range: test most common case first if (c < 0x20 && (c != '\t' && c != '\r' && c != '\n')) { // Illegal XML character, even encoded. Skip or substitute sb.append("&#xfffd;"); // Unicode replacement character } else { switch(c) { case '&': sb.append("&amp;"); break; case '>': sb.append("&gt;"); break; case '<': sb.append("&lt;"); break; // Uncomment next two if encoding for an XML attribute // case '\'' sb.append("&apos;"); break; // case '\"' sb.append("&quot;"); break; // Uncomment next three if you prefer, but not required // case '\n' sb.append("&#10;"); break; // case '\r' sb.append("&#13;"); break; // case '\t' sb.append("&#9;"); break; default: sb.append((char)c); } } } else if ((c >= 0xd800 && c <= 0xdfff) || c == 0xfffe || c == 0xffff) { // Illegal XML character, even encoded. Skip or substitute sb.append("&#xfffd;"); // Unicode replacement character } else { sb.append("&#x"); sb.append(Integer.toHexString(c)); sb.append(';'); } } return sb.toString(); }

Edit: for those who continue to insist that it’s stupid to write your own code for this, when there are perfectly good Java APIs for working with XML, you may be interested to know that the StAX API is included in Oracle Java 8 (I have not tested others ) cannot correctly encode the contents of CDATA: it is not hidden]]> sequences in the contents. A third-party library, even one that is part of the Java kernel, is not always the best option.

+5

Mike B 02 Feb. '18 at 17:36

source share

Although I agree with John Skeet in principle, sometimes I don’t have the opportunity to use an external XML library. And I believe that two functions for escape / unescape of a simple value (attribute or tag, not the full document) are not available in the standard XML libraries included in Java.

As a result, and based on various answers that I saw here and in other places, here is the solution I created (nothing works like a simple copy / paste):

  public final static String ESCAPE_CHARS = "<>&\"\'"; public final static List<String> ESCAPE_STRINGS = Collections.unmodifiableList(Arrays.asList(new String[] { "&lt;" , "&gt;" , "&amp;" , "&quot;" , "&apos;" })); private static String UNICODE_LOW = "" + ((char)0x20); //space private static String UNICODE_HIGH = "" + ((char)0x7f); //should only use for the content of an attribute or tag public static String toEscaped(String content) { String result = content; if ((content != null) && (content.length() > 0)) { boolean modified = false; StringBuilder stringBuilder = new StringBuilder(content.length()); for (int i = 0, count = content.length(); i < count; ++i) { String character = content.substring(i, i + 1); int pos = ESCAPE_CHARS.indexOf(character); if (pos > -1) { stringBuilder.append(ESCAPE_STRINGS.get(pos)); modified = true; } else { if ( (character.compareTo(UNICODE_LOW) > -1) && (character.compareTo(UNICODE_HIGH) < 1) ) { stringBuilder.append(character); } else { stringBuilder.append("&#" + ((int)character.charAt(0)) + ";"); modified = true; } } } if (modified) { result = stringBuilder.toString(); } } return result; }

The above contains several different things:

avoids the use of char-based logic until absolutely necessary - improves Unicode compatibility
tries to be as effective as possible, given the likelihood that the second “if” condition is probably the most used way
- pure function; i.e. thread safe
perfectly optimized with the garbage collector, returning only the contents of the StringBuilder, if something really changed - otherwise the original string is returned

At some point, I will write an inverse of this function, toUnescaped (). I just don't have time to do it today. When I do this, I will come to update this answer with the code. :)

+4

chaotic3quilibrium Dec 19 '13 at 23:09

source share

For those looking for the fastest solution: use the methods from apache commons-lang :

StringEscapeUtils.escapeXml10() for xml 1.0
StringEscapeUtils.escapeXml11() for xml 1.1
StringEscapeUtils.escapeXml() now deprecated, but used commonly in the past

Remember to include the dependency:

 <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> <version>3.5</version> <!--check current version! --> </dependency>

+4

Dariusz Mar 27 '17 at 13:16

source share

To avoid XML characters, the easiest way is to use the Apache Commons Lang, JAR project, downloaded from: http://commons.apache.org/lang/

The class is as follows: org.apache.commons.lang3.StringEscapeUtils;

It has a method called "escapeXml" that will return the corresponding escaped string.

+3

Greg Burdett Aug 31 '11 at

source share

Here's a simple solution, and it is great for character encoding with an accent!

 String in = "Hi Lârry & Môe!"; StringBuilder out = new StringBuilder(); for(int i = 0; i < in.length(); i++) { char c = in.charAt(i); if(c < 31 || c > 126 || "<>\"'\\&".indexOf(c) >= 0) { out.append("&#" + (int) c + ";"); } else { out.append(c); } } System.out.printf("%s%n", out);

Outputs

 Hi L&#226;rry &#38; M&#244;e!

+1

Mike Oct 26

source share

Use JAXP and forget about word processing, it will be done for you automatically.

0

Fernando Miguélez Jan 13 '09 at 15:18

source share

Try to encode XML using Apache XML serializer

 //Serialize DOM OutputFormat format = new OutputFormat (doc); // as a String StringWriter stringOut = new StringWriter (); XMLSerializer serial = new XMLSerializer (stringOut, format); serial.serialize(doc); // Display the XML System.out.println(stringOut.toString());

0

K Victor Rajan Apr 09 '14 at 9:53 on

source share

You can use the Enterprise Security API Library (ESAPI) , which provides methods such as encodeForXML and encodeForXMLAttribute . Take a look at the documentation of the Encoder interface; it also contains examples of how to create an instance of DefaultEncoder .

0

Vivit Mar 07 '18 at 9:13

source share

If you are looking for a library to get the job done, try:

Guava 26.0 is documented here
return XmlEscapers.xmlContentEscaper().escape(text);
Note: there is also xmlAttributeEscaper()
Apache Commons Text 1.4 is documented here.
StringEscapeUtils.escapeXml11(text)
Note. There is also an escapeXml10() method

0

jschnasse Sep 17 '18 at 9:46

source share

Just replace

  & with &amp;

And for other characters:

 > with &gt; < with &lt; \" with &quot; ' with &apos;

-one

raman rayat Aug 17 '18 at 7:33

source share

Jon Skeet · Accepted Answer · 2009-01-13 15:18

Very simple: use the XML library. This way, in fact, this will be correct, rather than requiring a detailed knowledge of the bits of the XML specification.

Best way to encode text data for XML in Java?

More articles: