Java removes non-latin base characters from a string

Let's say I have the following code:

String description = "β˜…β˜…β˜…β˜…β˜… β™« ♬ This description βœ”βœ” β–¬ β–ˆ βœ– is a mess. β™« ♬ β˜…β˜…β˜…β˜…β˜…"; 

I want to remove non-latin characters: βœ” , β–¬ , β–ˆ , βœ– , β™« , ♬ and β˜… .

And whether it becomes the following: This description is a mess.

I know that there are probably such characters that look like wings, so instead of specifying what I would like to delete, I think it's better to list what I want to keep: Basic Latin and Latin-1 complements characters.

I found that I can use the following code to remove everything except the basic Latin characters

String clean_description = description.replaceAll("[^\\x00-\\x7F]", "").trim();

But is there a way to preserve Latin-1 padding characters?

+6
source share
2 answers

From a glance at the ranges of characters that you indicated, it seems that the β€œbasic Latin” and β€œLatin-1 complements” are adjacent ( 0x00 - 0x7F and 0x80 - 0xFF ).

That way, you can use the same regular expression that you provided, just extended to include the Latin-1 Supplement characters. It will look like this:

 String clean_description = description.replaceAll("[^\\x00-\\xFF]", "").trim(); 

As pointed out in Quinn's comments, this does not eliminate the spaces between the deleted partitions, so the result has redundant spaces (which may or may not be what you want). If you want these spaces to be removed, Quinn regex ( [^(\\x00-\\xFF)]+(?:$|\\s*) , if you delete the comment) may work for you.

+6
source

If you want to use a more descriptive expression, use this:

 description.replaceAll( "[^\\p{InBasic_Latin}\\p{InLatin-1Supplement}]", "" ); 

or intersection of negatives [\P{InBasic_Latin}&&\P{InLatin-1Supplement}] (and not that it is more readable;))

+3
source

All Articles