I looked at Qaru ( replacing characters .. eh , how JavaScript does not conform to the Unicode standard regarding RegExp , etc.) and actually did not find a specific answer to the question:
How can JavaScript match for accented characters (those with diacritical marks)?
I make the field in the user interface conform to the format: last_name, first_name (last [comma space]), and I want to provide support for diacritics, but obviously this is a bit more complicated in JavaScript than other languages / platforms.
This was my original version until I wanted to add diacritical support:
/^[a-zA-Z]+,\s[a-zA-Z]+$/
I am currently discussing one of three ways to add support, all of which I tested and worked (at least to some extent, I do not know what the “degree” of the second approach is). Here they are:
An explicit listing of all accented characters that I would like to accept as valid (lame and too complex):
var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇߨøÅ寿œ";
- This matches the last / first name correctly with any of the supported accented characters in accentedCharacters.
My other approach was to use a character class . to have a simpler expression:
var regex = /^.+,\s.+$/;
- This will correspond to just about anything, at least in the form of:
something, something . This is good, I suppose ...
The last approach I just found might be simpler ...
/^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/
- It corresponds to the range of Unicode characters - it is checked and works, although I have not tried anything crazy, it’s just normal material that I see in our language department for the names of faculty members.
Here are my problems:
- The first solution is too limited and careless and confusing. It would have to be changed if I forgot a character or two, and that is just not very practical.
- The second solution is better, concise, but probably a lot more than it actually is. I could not find any real documentation about what exactly matches
. , just a generalization of "any character except the newline character" (from a table on MDN ). The third solution seems the most accurate, but are there any errors? I am not very familiar with Unicode, at least in practice, but I look at the code table / continuation of this table , \u00C0-\u017F seems pretty solid, at least for the expected input.
- The faculty will not submit forms with their names in their native language (for example, in Arabic, Chinese, Japanese, etc.), so I do not need to worry about characters with characters other than Latin
So the real question (s) : Which of these three approaches is most suitable for the task? Or are there better solutions?
javascript regex unicode
Chris Cirefice Dec 19 '13 at 19:54 2013-12-19 19:54
source share