UTF-8 data type validation with 3-byte or 4-byte Unicode

I get an error in my database

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 

I use Java and MySQL 5. As I know, 4-byte Unicode is legal I am Java, but illegal in MySQL 5, I think this may cause my problem and I want to check the type of my data, so here my question is: How can I verify that my UTF-8 data is 3 byte or 4 byte Unicode?

+7
source share
3 answers

UTF-8 encodes everything in the base multilingual plane (i.e., U + 0000 to U + FFFF inclusive) in 1-3 bytes. So you just need to check if everything in your line is in BMP.

In Java, this means checking whether any char (which is a block of UTF-16 code) is a high or low surrogate character, since Java will use surrogate pairs to encode non-BMP characters:

 public static boolean isEntirelyInBasicMultilingualPlane(String text) { for (int i = 0; i < text.length(); i++) { if (Character.isSurrogate(text.charAt(i))) { return false; } } return true; } 
+15
source

If you do not want to support BMP support, you can simply hide these characters before passing to MySQL:

 public static String withNonBmpStripped( String input ) { if( input == null ) throw new IllegalArgumentException("input"); return input.replaceAll("[^\\u0000-\\uFFFF]", ""); } 

If you want to support more than BMP, you need MySQL 5.5+, and you need to change everything utf8 to utf8mb4 (sortings, encodings ...). But you also need support for this in a driver that I am not familiar with. Processing these characters in Java is also a pain because they are distributed over 2 chars and therefore require special handling in many operations.

+10
source

The best approach to marking non-BMP characteristics in java that I found is as follows:

 inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD"); 
+3
source

All Articles