How to check if xml textnode has Chinese characters with RegEx in XSLT

On this website http://gskinner.com/RegExr/ (which is the RegEx website) this regular match works Match: [^\x00-\xff]
Sample text: test123 ๆˆ–ๅ…ƒไปถๆ•ฐๆฎไธๅฏ็”จ

But if I have this input XML:

 <?xml version="1.0" encoding="UTF-8" ?> <root> <node>test123 ๆˆ–ๅ…ƒไปถๆ•ฐๆฎไธๅฏ็”จ</node> </root> 

and I try this XSLT 2.0 stylesheet with Saxon 9:

 <?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/root/node"> <xsl:if test="matches(., '[^\x00-\xff]')"> <xsl:text>Text has chinese characters!</xsl:text> </xsl:if> </xsl:template> </xsl:stylesheet> 

Saxon 9 gives me the following error output:

  FORX0002: Error at character 3 in regular expression "[^\x00-\xff]": invalid escape sequence Failed to compile stylesheet. 1 error detected. 

How to check Chinese characters inside XSLT 2.0?

+4
source share
2 answers

With the help of Michael Kay, I can answer my own question. Thanks Michael! The solution works, but in my opinion, these long Unicode ranges do not look very pretty.

This XSLT will print a text message if any Chinese character was found with regular expressions in this XML:

 <?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/root/node"> <xsl:if test="matches(.,'[&#x4E00;-&#x9FFF;&#x3400;-&#x4DFF;&#x20000;-&#x2A6DF;&#xF900;-&#xFAFF;&#x2F800;-&#x2FA1F;]')"> <xsl:text>Text has chinese characters!</xsl:text> </xsl:if> </xsl:template> </xsl:stylesheet> 

Solution with a Unicode named block:

 <?xml version="1.0" encoding="UTF-8" ?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/root/node"> <xsl:if test="matches(., '[\p{IsCJKUnifiedIdeographs}\p{IsCJKUnifiedIdeographsExtensionA}\p{IsCJKUnifiedIdeographsExtensionB}\p{IsCJKCompatibilityIdeographs}\p{IsCJKCompatibilityIdeographsSupplement}]')"> <xsl:text>Text has chinese characters!</xsl:text> </xsl:if> </xsl:template> </xsl:stylesheet> 
+3
source

The XPath-supported regex dialog is based on what is defined in XSD: you can find the full specifications in the W3C docs or, if you prefer something more readable, in my XSLT 2.0 Programming Reference. Do not assume that all regular expression dialogs are the same. There is no \x escape in XPath regexen because it is designed to be embedded in XML, which already offers &#xHHHH; .

Instead of using the six-range range, it may be more convenient for you to use a named Unicode block, for example \p{IsCJKUnifiedIdeographs} .

See also. What is the full range for Chinese characters in Unicode?

+3
source

All Articles