Java removes non-latin base characters from a string

Question

Java removes non-latin base characters from a string

Let's say I have the following code:

String description = "★★★★★ ♫ ♬ This description ✔✔ ▬ █ ✖ is a mess. ♫ ♬ ★★★★★";

I want to remove non-latin characters: ✔ , ▬ , █ , ✖ , ♫ , ♬ and ★ .

And whether it becomes the following: This description is a mess.

I know that there are probably such characters that look like wings, so instead of specifying what I would like to delete, I think it's better to list what I want to keep: Basic Latin and Latin-1 complements characters.

I found that I can use the following code to remove everything except the basic Latin characters

String clean_description = description.replaceAll("[^\\x00-\\x7F]", "").trim();

But is there a way to preserve Latin-1 padding characters?

+6

java regex unicode

RoboticR Mar 16 '16 at 14:40

source share

2 answers

If you want to use a more descriptive expression, use this:

 description.replaceAll( "[^\\p{InBasic_Latin}\\p{InLatin-1Supplement}]", "" );

or intersection of negatives [\P{InBasic_Latin}&&\P{InLatin-1Supplement}] (and not that it is more readable;))

+3

Thomas Mar 16 '16 at 14:56

source share

resueman · Accepted Answer · 2016-03-16T14:52:40+0000

From a glance at the ranges of characters that you indicated, it seems that the “basic Latin” and “Latin-1 complements” are adjacent ( 0x00 - 0x7F and 0x80 - 0xFF ).

That way, you can use the same regular expression that you provided, just extended to include the Latin-1 Supplement characters. It will look like this:

 String clean_description = description.replaceAll("[^\\x00-\\xFF]", "").trim();

As pointed out in Quinn's comments, this does not eliminate the spaces between the deleted partitions, so the result has redundant spaces (which may or may not be what you want). If you want these spaces to be removed, Quinn regex ( [^(\\x00-\\xFF)]+(?:$|\\s*) , if you delete the comment) may work for you.

Java removes non-latin base characters from a string

More articles: