C # regular expressions with \ Uxxxxxxxx characters in a pattern

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" ) 

Throws: System.ArgumentException: parsing the "[-]" - [xy] range in reverse order.

Looking at the hexadecimal values ​​for \ U00010000 and \ U0010FFF, I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.

So, I really have one problem. Why are Unicode characters formed with \ U split into two lines per line?

+5
c # regex unicode astral-plane
source share
3 answers

They are surrogate pairs . Look at the values ​​- they exceed 65535. char is only a 16-bit value. How would you express 65536 in just 16 bits?

Unfortunately, it’s not clear from the documentation how (or) the regular expression engine in .NET handles characters that are not in the basic multilingual plane. (The sample \ uxxxx in the regular expression documentation covers only 0-65535, like \ uxxxx as a C # escape sequence.)

Is your real regex big, or are you just trying to see if there are any characters without BMP there?

+9
source share

To get around such things with the regex .Net engine, I use the following trick: "[\U010000-\U10FFFF]" is replaced by [\uD800-\uDBFF][\uDC00-\uDFFF] The idea is that since regular .Net expressions process a block of code instead of code points, we provide it with surrogate ranges like regular characters. You can also specify narrower ranges when working with edges, for example: [\U011DEF-\U013E07] will be the same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])

It’s harder to read and work, and it’s not so flexible, but still suitable for workarounds.

+3
source share

@Jon Skeet

So are you telling me that there is no way to use Regex tools in .net to match characters outside the utf-16 range?

Full regex:

 ^(\u0009|[\u0020-\u007E]|\u0085|[\u00A0-\uD7FF]|[\uE000-\uFFFD]|[\U00010000-\U0010FFFF])+$ 

I am trying to check if a string contains only what the yaml document defines as ready-to-use Unicode organizations.

+1
source share

All Articles