Removing accent marks (diacritics) from Latin characters for comparison

I need to compare the names of European places that are written using the Latin alphabet with accents (diacritics) on some characters. There are many names in Central and Eastern Europe that are written with the same Latin characters as ž and ü , but some people write names using ordinary Latin characters without accent marks, such as z and u .

I need my system to recognize, for example, mšk žilina , which is the same as msk zilina , and similar for all other accented characters. Is there an easy way to do this?

+6
java string diacritics transliteration
source share
1 answer

You can use java.text.Normalizer and a small regular expression to get rid of diacritics .

 public static String removeDiacriticalMarks(String string) { return Normalizer.normalize(string, Form.NFD) .replaceAll("\\p{InCombiningDiacriticalMarks}+", ""); } 

Usage example:

 String text = "mšk žilina"; String normalized = removeDiacriticalMarks(text); System.out.println(normalized); // msk zilina 
+11
source share

All Articles