UTF-8 data type validation with 3-byte or 4-byte Unicode

Question

UTF-8 data type validation with 3-byte or 4-byte Unicode

I get an error in my database

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column

I use Java and MySQL 5. As I know, 4-byte Unicode is legal I am Java, but illegal in MySQL 5, I think this may cause my problem and I want to check the type of my data, so here my question is: How can I verify that my UTF-8 data is 3 byte or 4 byte Unicode?

+7

java mysql unicode utf-8 character-encoding

akuzma Feb 20 '13 at 13:31

source share

3 answers

If you do not want to support BMP support, you can simply hide these characters before passing to MySQL:

 public static String withNonBmpStripped( String input ) { if( input == null ) throw new IllegalArgumentException("input"); return input.replaceAll("[^\\u0000-\\uFFFF]", ""); }

If you want to support more than BMP, you need MySQL 5.5+, and you need to change everything utf8 to utf8mb4 (sortings, encodings ...). But you also need support for this in a driver that I am not familiar with. Processing these characters in Java is also a pain because they are distributed over 2 chars and therefore require special handling in many operations.

+10

Esailija Feb 20 '13 at 15:29

source share

The best approach to marking non-BMP characteristics in java that I found is as follows:

 inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

+3

verglor Nov 18 '13 at 4:39

source share

Jon skeet · Accepted Answer · 2013-02-20T13:37:08+0000

UTF-8 encodes everything in the base multilingual plane (i.e., U + 0000 to U + FFFF inclusive) in 1-3 bytes. So you just need to check if everything in your line is in BMP.

In Java, this means checking whether any char (which is a block of UTF-16 code) is a high or low surrogate character, since Java will use surrogate pairs to encode non-BMP characters:

 public static boolean isEntirelyInBasicMultilingualPlane(String text) { for (int i = 0; i < text.length(); i++) { if (Character.isSurrogate(text.charAt(i))) { return false; } } return true; }

UTF-8 data type validation with 3-byte or 4-byte Unicode

More articles: