Regexp check if code contains non-UTF-8 characters?

Question

Regexp check if code contains non-UTF-8 characters?

I use PMD, checkstyle, findbugs etc. at sonar. I would like to have a rule confirming that Java code does not contain characters that are not in UTF-8.

eg. character must not be allowed

I could not find the rules for this in the above plugins, but I think a custom rule can be made in Sonar.

+7

java regex utf-8 sonarqube

user1340582 Oct 29 '12 at 6:12

source share

1 answer

kshepherd · Accepted Answer · 2012-10-31T17:45:54+0000

Here is a regular expression that will only match valid UTF-8 byte sequences:

/^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/

I got it from RFC 3629 UTF-8, ISO 10646 conversion format , section 4 - UTF-8 Byte Sequence Syntax

Factoring the above gives a little shorter:

 /^([\x00-\x7F]|([\xC2-\xDF]|\xE0[\xA0-\xBF]|\xED[\x80-\x9F]|(|[\xE1-\xEC]|[\xEE-\xEF]|\xF0[\x90-\xBF]|\xF4[\x80-\x8F]|[\xF1-\xF3][\x80-\xBF])[\x80-\xBF])[\x80-\xBF])*$/

This simple perl script demonstrates the use of:

 #!/usr/bin/perl -w my $passstring = "This string \xEF\xBF\xBD ==   is valid UTF-8"; my $failstring = "This string \x{FFFD} ==   is not valid UTF-8"; if ($passstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/) { print 'Passstring passed'."\n"; } else { print 'Passstring did not pass'."\n"; } if ($failstring =~ /^([\x00-\x7F]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|[\xEE-\xEF][\x80-\xBF]{2}|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})*$/) { print 'Failstring passed'."\n"; } else { print 'Failstring did not pass'."\n"; } exit;

It produces the following output:

 Passstring passed Failstring did not pass

Regexp check if code contains non-UTF-8 characters?

More articles: