Yes, you need denial. The regular expression will be [^\p{L}] for everything except the letters . Another way to write this would be \P{L} .
\p{M} means "all marks", so [^\p{L}\p{M}] means ** everything that is neither a letter, nor a mark. It can also be written as [\P{L}&&[\P{M}]] , but it is not better.
In Java-String, all \ must be doubled, so you write string.replaceAll("[^\\p{L}\\p{M}]", "replacement") .
From the comment:
By the way, regarding your answer, what is included in the category of brands? Do I need this too? Will there be more than just letters for the first name?
This category consists of subcategories.
Mn: Mark, Non-Spacing
An example for this is Μ U + 0300. This is COMBINING GRAVE ACCENT, and it can be used together with the letter (letter before) to create accented characters. For commonly used accented characters, there is already a pre-selected form (e.g. Γ© ), but not for others.
Mc: Mark, Spacing Combining.
This is quite rare ... I found them mainly in South Asian scripts and for musical notes. For example, we have U + 1D165, MEN'S SYMBOL COMBINED STEM. ν
¦, which can be combined with U + 1D15D, MUSICAL SYMBOL WHOLE NOTE, ν
, with something like ν
ν
¦. (Hm, the images do not look here. I believe that my browser does not support these characters. Look at the charts if they are not here.)
Me: Mark, Enclosing
These are labels that somehow enclose the base letter (the previous one, if I understand correctly). For example, U + 20DD, β, which allows you to create things like Aβ . (This should display as A enclosed in a circle, if I understand correctly. This is not the case in my browser). Another will be U + 20E3, β£, a COMBINATION KEY, which should display a key cap with a letter on it (Aβ£). (They do not appear in my browser. Look at the chart if you do not see them.)
You can find all of them by searching Unicode-Data.txt for ;Mn; ;Mc; or ;Me; , respectively. For more information, see Frequently asked questions: symbols and character combinations .
Do you need them? I'm not sure here. It seems that the most common names (at least in Latin alphabets) will use pre-composed letters. But the user can enter them in a decomposed form - I think that in Mac OS X this is actually the default value. Before filtering unknown characters, you will have to run a normalization algorithm. (Performing normalization seems like a good idea anyway if you want to compare names and not only display them on the screen.)
Edit: not directly related to the question, but related to the discussion in the comments:
I wrote a quick test program to show that [^\pL\pM] not equivalent to [\PL\PM] :
package de.fencing_game.paul.examples; import java.util.regex.*; public class RegexSample { static String[] regexps = { "[^\\pL\\pM]", "[\\PL\\PM]", ".", "\\pL", "\\pM", "\\PL", "\\PM" }; static String[] strings = { "x", "A", "3", "\n", ".", "\t", "\r", "\f", " ", "-", "!", "Β»", "βΊ", "βΉ", "Β«", "Ν³", "Ξ", "Ξ£", "Οͺ", "", "Ψ€", "ΰΌ¬", "ΰΌΊ", "ΰΌΌ", "ΰ½", "β", "β", "βͺ", "γ", "γ", "+", "β", "β", "β’", "β»", "β", "β§", "β§»", "βͺ", "β", "β°", "β", "βΆ", "\u0300" , "\u0BCD" , "\u20DD" , "\u2166" , }; public static void main(String[] params) { Pattern[] patterns = new Pattern[regexps.length]; System.out.print(" "); for(int i = 0; i < regexps.length; i++) { patterns[i] = Pattern.compile(regexps[i]); System.out.print("| " + patterns[i] + " "); } System.out.println(); System.out.print("-------"); for(int i = 0; i < regexps.length; i++) { System.out.print("|-" + "--------------".substring(0, regexps[i].length()) + "-"); } System.out.println(); for(int j = 0; j < strings.length; j++) { System.out.printf("U+%04x ", (int)strings[j].charAt(0)); for(int i = 0; i < regexps.length; i++) { boolean match = patterns[i].matcher(strings[j]).matches(); System.out.print("| " + (match ? "β" : "-") + " ".substring(0, regexps[i].length())); } System.out.println(); } } }
Here is the result (from OpenJDK 1.6.0_20 on OpenSUSE):
| [^\pL\pM] | [\PL\PM] | . | \pL | \pM | \PL | \PM -------|-----------|----------|---|-----|-----|-----|----- U+0078 | - | β | β | β | - | - | β U+0041 | - | β | β | β | - | - | β U+0033 | β | β | β | - | - | β | β U+000a | β | β | - | - | - | β | β U+002e | β | β | β | - | - | β | β U+0009 | β | β | β | - | - | β | β U+000d | β | β | - | - | - | β | β U+000c | β | β | β | - | - | β | β U+0020 | β | β | β | - | - | β | β U+002d | β | β | β | - | - | β | β U+0021 | β | β | β | - | - | β | β U+00bb | β | β | β | - | - | β | β U+203a | β | β | β | - | - | β | β U+2039 | β | β | β | - | - | β | β U+00ab | β | β | β | - | - | β | β U+0373 | β | β | β | - | - | β | β U+0398 | - | β | β | β | - | - | β U+03a3 | - | β | β | β | - | - | β U+03ea | - | β | β | β | - | - | β U+0416 | - | β | β | β | - | - | β U+0624 | - | β | β | β | - | - | β U+0f2c | β | β | β | - | - | β | β U+0f3a | β | β | β | - | - | β | β U+0f3c | β | β | β | - | - | β | β U+0f44 | - | β | β | β | - | - | β U+20d3 | - | β | β | - | β | β | - U+2704 | β | β | β | - | - | β | β U+27ea | β | β | β | - | - | β | β U+3084 | - | β | β | β | - | - | β U+3099 | - | β | β | - | β | β | - U+002b | β | β | β | - | - | β | β U+2192 | β | β | β | - | - | β | β U+2211 | β | β | β | - | - | β | β U+2222 | β | β | β | - | - | β | β U+203b | β | β | β | - | - | β | β U+2049 | β | β | β | - | - | β | β U+29d3 | β | β | β | - | - | β | β U+29fb | β | β | β | - | - | β | β U+246a | β | β | β | - | - | β | β U+2484 | β | β | β | - | - | β | β U+24b0 | β | β | β | - | - | β | β U+24db | β | β | β | - | - | β | β U+24f6 | β | β | β | - | - | β | β U+0300 | - | β | β | - | β | β | - U+0bcd | - | β | β | - | β | β | - U+20dd | - | β | β | - | β | β | - U+2166 | β | β | β | - | - | β | β
We can see that:
[^\pL\pM] not equivalent [\PL\PM][\PL\PM] really matches everyone, but- still
[\PL\PM] not equal . since . does not match \n and \r .
The second point is because [\PL\PM] is union of \PL and \PM : \PL contains characters from all categories other than L (including M), and \PM contains characters from all categories other than M ( including L) - together they contain the entire repertoire of the character.
[^pL\pM] , on the other hand, is a complement to the union of \PL and \PM , which is equivalent to the intersection of \PL and PM .