Regex to find all variations of a specific character within text

I am trying to find unicode variants of a user-entered character in the text to highlight them. For instance. if the user is in Beyonce, I’d like to highlight all the text with options such as Beyoncé or Beyônce or Beyonce in the text. Currently, the only idea I have is to create a regular expression by replacing the input string with a set such as:

"Beyonce" => "B[eêéè]y[óòôö]c[éèê]"

But this seems to be a very tedious and error-prone way to do this. Basically, I'm looking for a regular expression character group that matches all variants of a given input character, something like \ p {M}, but with the ability to specify a base letter. Is there something similar in java regex? And if not, how can you improve the process of creating regular expressions? I do not think that specifying all the options manually will work in the end.

+5
source share
1 answer

There are several ways an accent symbol can be represented. Here is a good example in javadoc java.text.Normalizer:

For example, take the character A-acute. In Unicode, this can be encoded
as a single character (the "composed" form):

  U+00C1    LATIN CAPITAL LETTER A WITH ACUTE

or as two separate characters (the "decomposed" form):

  U+0041    LATIN CAPITAL LETTER A
  U+0301    COMBINING ACUTE ACCENT 

The second form will facilitate access to the inactive character, and, fortunately, Normalizer can help you here:

Normalizer.normalize(text, Form.NFD); // NFD = "Canonical decomposition"

( ) -ASCII- :

[^\p{ASCII}]
+2

All Articles