How to write a regular expression for a unicode name in Java?

Question

How to write a regular expression for a unicode name in Java?

I need to write a regex so that I can replace invalid characters in user input before sending it further. I think I need to use string.replaceAll("regex", "replacement") for this. A particular line of code should replace all characters that are not Unicode letters. So this is a whitelist of Unicode characters. Basically, this is checking and replacing invalid username characters.

What I have found so far is this: \p{L}\p{M} , but I'm not sure how to run it in regexp for it to work, as I explained above. Will this be a case of negative expression?

+4

java regex unicode character-properties

Rihards Jun 27 '11 at 13:54

source share

2 answers

I do not believe that the default Javas regex library (read: outside the ICU link, which I suggest doing even if it requires JNI) supports the Unicode properties you need for this.

If this happened, you would include \p{Diacritic} in your template. But for this you need full support.

I suppose you could shoot for (\pL\pM*)+ , but that doesn’t work for different diacritics: what if someone’s name is not just Étoile , but L'étoile ?

In addition, I thought that the problem of checking people's names was considered almost insoluble, and therefore you should just let people use what they like, maybe the RFC 3454s "stringprep" algorithm has been cleared.

+2

tchrist Jun 27 '11 at 14:12

source share

Paŭlo Ebermann · Accepted Answer · 2011-06-27T14:13:13+0000

Yes, you need denial. The regular expression will be [^\p{L}] for everything except the letters . Another way to write this would be \P{L} .

\p{M} means "all marks", so [^\p{L}\p{M}] means ** everything that is neither a letter, nor a mark. It can also be written as [\P{L}&&[\P{M}]] , but it is not better.

In Java-String, all \ must be doubled, so you write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") .

From the comment:

By the way, regarding your answer, what is included in the category of brands? Do I need this too? Will there be more than just letters for the first name?

This category consists of subcategories.

Mn: Mark, Non-Spacing
An example for this is ̀ U + 0300. This is COMBINING GRAVE ACCENT, and it can be used together with the letter (letter before) to create accented characters. For commonly used accented characters, there is already a pre-selected form (e.g. é ), but not for others.
Mc: Mark, Spacing Combining.
This is quite rare ... I found them mainly in South Asian scripts and for musical notes. For example, we have U + 1D165, MEN'S SYMBOL COMBINED STEM. 텦, which can be combined with U + 1D15D, MUSICAL SYMBOL WHOLE NOTE, 텝, with something like 텝텦. (Hm, the images do not look here. I believe that my browser does not support these characters. Look at the charts if they are not here.)
Me: Mark, Enclosing
These are labels that somehow enclose the base letter (the previous one, if I understand correctly). For example, U + 20DD, ⃝, which allows you to create things like A⃝ . (This should display as A enclosed in a circle, if I understand correctly. This is not the case in my browser). Another will be U + 20E3, ⃣, a COMBINATION KEY, which should display a key cap with a letter on it (A⃣). (They do not appear in my browser. Look at the chart if you do not see them.)

You can find all of them by searching Unicode-Data.txt for ;Mn; ;Mc; or ;Me; , respectively. For more information, see Frequently asked questions: symbols and character combinations .

Do you need them? I'm not sure here. It seems that the most common names (at least in Latin alphabets) will use pre-composed letters. But the user can enter them in a decomposed form - I think that in Mac OS X this is actually the default value. Before filtering unknown characters, you will have to run a normalization algorithm. (Performing normalization seems like a good idea anyway if you want to compare names and not only display them on the screen.)

Edit: not directly related to the question, but related to the discussion in the comments:

I wrote a quick test program to show that [^\pL\pM] not equivalent to [\PL\PM] :

 package de.fencing_game.paul.examples; import java.util.regex.*; public class RegexSample { static String[] regexps = { "[^\\pL\\pM]", "[\\PL\\PM]", ".", "\\pL", "\\pM", "\\PL", "\\PM" }; static String[] strings = { "x", "A", "3", "\n", ".", "\t", "\r", "\f", " ", "-", "!", "»", "›", "‹", "«", "ͳ", "Θ", "Σ", "Ϫ", "", "ؤ", "༬", "༺", "༼", "ང", "⃓", "✄", "⟪", "や", "゙", "+", "→", "∑", "∢", "※", "⁉", "⧓", "⧻", "⑪", "⒄", "⒰", "ⓛ", "⓶", "\u0300" /* COMBINING GRAVE ACCENT, Mn */, "\u0BCD" /* TAMIL SIGN VIRAMA, Me */, "\u20DD" /* COMBINING ENCLOSING CIRCLE, Me */, "\u2166" /* ROMAN NUMERAL SEVEN, Nl */, }; public static void main(String[] params) { Pattern[] patterns = new Pattern[regexps.length]; System.out.print(" "); for(int i = 0; i < regexps.length; i++) { patterns[i] = Pattern.compile(regexps[i]); System.out.print("| " + patterns[i] + " "); } System.out.println(); System.out.print("-------"); for(int i = 0; i < regexps.length; i++) { System.out.print("|-" + "--------------".substring(0, regexps[i].length()) + "-"); } System.out.println(); for(int j = 0; j < strings.length; j++) { System.out.printf("U+%04x ", (int)strings[j].charAt(0)); for(int i = 0; i < regexps.length; i++) { boolean match = patterns[i].matcher(strings[j]).matches(); System.out.print("| " + (match ? "✔" : "-") + " ".substring(0, regexps[i].length())); } System.out.println(); } } }

Here is the result (from OpenJDK 1.6.0_20 on OpenSUSE):

  | [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM -------|-----------|----------|---|-----|-----|-----|----- U+0078 | - | ✔ | ✔ | ✔ | - | - | ✔ U+0041 | - | ✔ | ✔ | ✔ | - | - | ✔ U+0033 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+000a | ✔ | ✔ | - | - | - | ✔ | ✔ U+002e | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0009 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+000d | ✔ | ✔ | - | - | - | ✔ | ✔ U+000c | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0020 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+002d | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0021 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+00bb | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+203a | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2039 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+00ab | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0373 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0398 | - | ✔ | ✔ | ✔ | - | - | ✔ U+03a3 | - | ✔ | ✔ | ✔ | - | - | ✔ U+03ea | - | ✔ | ✔ | ✔ | - | - | ✔ U+0416 | - | ✔ | ✔ | ✔ | - | - | ✔ U+0624 | - | ✔ | ✔ | ✔ | - | - | ✔ U+0f2c | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0f3a | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0f3c | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0f44 | - | ✔ | ✔ | ✔ | - | - | ✔ U+20d3 | - | ✔ | ✔ | - | ✔ | ✔ | - U+2704 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+27ea | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+3084 | - | ✔ | ✔ | ✔ | - | - | ✔ U+3099 | - | ✔ | ✔ | - | ✔ | ✔ | - U+002b | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2192 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2211 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2222 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+203b | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2049 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+29d3 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+29fb | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+246a | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+2484 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+24b0 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+24db | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+24f6 | ✔ | ✔ | ✔ | - | - | ✔ | ✔ U+0300 | - | ✔ | ✔ | - | ✔ | ✔ | - U+0bcd | - | ✔ | ✔ | - | ✔ | ✔ | - U+20dd | - | ✔ | ✔ | - | ✔ | ✔ | - U+2166 | ✔ | ✔ | ✔ | - | - | ✔ | ✔

We can see that:

[^\pL\pM] not equivalent [\PL\PM]
[\PL\PM] really matches everyone, but
still [\PL\PM] not equal . since . does not match \n and \r .

The second point is because [\PL\PM] is union of \PL and \PM : \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M ( including L) - together they contain the entire repertoire of the character.

[^pL\pM] , on the other hand, is a complement to the union of \PL and \PM , which is equivalent to the intersection of \PL and PM .

How to write a regular expression for a unicode name in Java?

More articles: