How to check if an instance of CharSequence is a sequence of scalar Unicode values?

Question

How to check if an instance of CharSequence is a sequence of scalar Unicode values?

I have an instance of java.lang.CharSequence . I need to determine if this instance is Unicode scanning values (i.e. whether the instance is in the form of UTF-16 encoding). Despite the assurances of java.lang.String , the Java string does not have to be in UTF-16 encoding (at least not in accordance with the latest Unicode specification, currently 6.2), as it may contain isolated surrogate code units . (However, the Java string is a Unicode 16-bit string .)

There are several obvious ways to do this, including:

Scroll through the code points of the sequence, explicitly checking each of them as a Unicode scalar value.
Use a regular expression to search for isolated surrogate code points.
Conduct a character sequence using a character encoder that reports the error encoding .

It seems like something like this should already exist as a library function. I just can't find it in the standard API. Did I miss it or do I need to implement it?

+4

java unicode utf-16 charsequence surrogate-pairs

Nathan ryan Apr 4 '13 at 10:41

source share

1 answer

Evgeniy Dorofeev · Accepted Answer · 2013-04-04T11:05:57+0000

try this feature

 static boolean isValidUTF16(String s) { for (int i = 0; i < s.length(); i++) { if (Character.isLowSurrogate(s.charAt(i)) && (i == 0 || !Character.isHighSurrogate(s.charAt(i - 1))) || Character.isHighSurrogate(s.charAt(i)) && (i == s.length() -1 || !Character.isLowSurrogate(s.charAt(i + 1)))) { return false; } } return true; }

here is the test

 public static void main(String args[]) { System.out.println(isValidUTF16("\uDC00\uDBFF")); System.out.println(isValidUTF16("\uDBFF\uDC00")); }

How to check if an instance of CharSequence is a sequence of scalar Unicode values?

More articles: