Specific Javascript Regex for accented characters (diacritics)

Question

Specific Javascript Regex for accented characters (diacritics)

I looked at Qaru ( replacing characters .. eh , how JavaScript does not conform to the Unicode standard regarding RegExp , etc.) and actually did not find a specific answer to the question:

How can JavaScript match for accented characters (those with diacritical marks)?

I make the field in the user interface conform to the format: last_name, first_name (last [comma space]), and I want to provide support for diacritics, but obviously this is a bit more complicated in JavaScript than other languages / platforms.

This was my original version until I wanted to add diacritical support:

/^[a-zA-Z]+,\s[a-zA-Z]+$/

I am currently discussing one of three ways to add support, all of which I tested and worked (at least to some extent, I do not know what the “degree” of the second approach is). Here they are:

An explicit listing of all accented characters that I would like to accept as valid (lame and too complex):

 var accentedCharacters = "àèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ"; // Build the full regex var regex = "^[a-zA-Z" + accentedCharacters + "]+,\\s[a-zA-Z" + accentedCharacters + "]+$"; // Create a RegExp from the string version regexCompiled = new RegExp(regex); // regexCompiled = /^[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+,\s[a-zA-ZàèìòùÀÈÌÒÙáéíóúýÁÉÍÓÚÝâêîôûÂÊÎÔÛãñõÃÑÕäëïöüÿÄËÏÖÜŸçÇßØøÅåÆæœ]+$/

This matches the last / first name correctly with any of the supported accented characters in accentedCharacters.

My other approach was to use a character class `.` to have a simpler expression:

 var regex = /^.+,\s.+$/;

This will correspond to just about anything, at least in the form of: something, something . This is good, I suppose ...

The last approach I just found might be simpler ...

 /^[a-zA-Z\u00C0-\u017F]+,\s[a-zA-Z\u00C0-\u017F]+$/

It corresponds to the range of Unicode characters - it is checked and works, although I have not tried anything crazy, it’s just normal material that I see in our language department for the names of faculty members.

Here are my problems:

The first solution is too limited and careless and confusing. It would have to be changed if I forgot a character or two, and that is just not very practical.
The second solution is better, concise, but probably a lot more than it actually is. I could not find any real documentation about what exactly matches . , just a generalization of "any character except the newline character" (from a table on MDN ).
The third solution seems the most accurate, but are there any errors? I am not very familiar with Unicode, at least in practice, but I look at the code table / continuation of this table , \u00C0-\u017F seems pretty solid, at least for the expected input.
- The faculty will not submit forms with their names in their native language (for example, in Arabic, Chinese, Japanese, etc.), so I do not need to worry about characters with characters other than Latin

So the real question (s) : Which of these three approaches is most suitable for the task? Or are there better solutions?

+115

javascript regex unicode

Chris Cirefice Dec 19 '13 at 19:54

source share

7 answers

An easier way to take all the emphasis is to:

 [A-zÀ-ú] // accepts lowercase and uppercase characters [A-zÀ-ÿ] // as above but including letters with an umlaut (includes [ ] ^ \ × ÷) [A-Za-zÀ-ÿ] // as above but not including [ ] ^ \ [A-Za-zÀ-ÖØ-öø-ÿ] // as above but not including [ ] ^ \ × ÷

See https://unicode-table.com/en/ for characters listed in numerical order.

+178

Maycow Moura Nov 13 '14 at 2:02

source share

\u00C0-\u017F latin range \u00C0-\u017F was not enough for my name database, so I expanded the expression to

 [a-zA-Z\u00C0-\u024F] [a-zA-Z\u00C0-\u024F\u1E00-\u1EFF] // includes even more Latin chars

I added these blocks of code ( \u00C0-\u024F includes three adjacent blocks at the same time):

\u00C0-\u00FF Latin-1
\u0100-\u017F Latin Extended-A
\u0180-\u024F Latin extended-B
\u1E00-\u1EFF Latin extended additional

Please note that \u00C0-\u00FF is actually only part of the Latin-1 add-on . This range passes non-printable control signals and all characters except for the awkwardly placed multiplication × \u00D7 and division ÷ \u00F7 .

\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF can replace \u00C0-\u00FF to exclude × ÷

If you need more code points, you can find more ranges on Wikipedia. Unicode character list . For example, you can also add Latin Extended-C , D and E , but I did not take them into account, because now they are only interested in historians, and sets D and E are not even displayed correctly in my browser.

The original regular expression that stopped at \u017F terminated under the name "olenol". According to the FontSpace Unicode Analyzer , the first character is \u0218 , LATIN CAPITAL LETTER S with a comma below. (Yes, this is usually written using cedilla-S \u015E , “Chenol.” But I am not flying to Turkey to tell him: “You are spelling your name incorrectly!”)

+24

Chaim Leib Halbert Aug 24 '16 at 23:38

source share

The XRegExp library has a plugin called Unicode that helps solve such problems.

 <script src="xregexp.js"></script> <script src="addons/unicode/unicode-base.js"></script> <script> var unicodeWord = XRegExp("^\\p{L}+$"); unicodeWord.test(""); // true unicodeWord.test("日本語"); // true unicodeWord.test("العربية"); // true </script>

He mentioned in the comments on the question, but it is easy to miss. I noticed this only after I submitted this answer.

+14

thorn̈ Jan 10 '15 at 15:50

source share

How about this?

 /^[a-zA-ZÀ-ÖØ-öø-ÿ]+$/

+9

alchn Jul 07 '17 at 3:37

source share

How about this?

 ^([a-zA-Z]|[à-ú]|[À-Ú])+$

It will match every word with accented characters or not.

+6

Javier Pallarés Dec 05 '18 at 11:56

source share

from this wiki: https://en.wikipedia.org/wiki/List_of_Unicode_characters#Basic_Latin

for latin letters i use

 /^[A-zÀ-ÖØ-öø-ÿ]+$/

avoids hyphens and special characters

+5

fdsfdsfdsfds Apr 27 '17 at 6:57

source share

Bergi · Accepted Answer · 2013-12-19 21:40

Which of these three approaches is most suitable for the task?

Depends on the task :-) To exactly match all Latin characters and their accented versions, Unicode ranges are likely to provide a better solution. They can be expanded to all characters without spaces, which can be done using the \S character class.

I make the field in the user interface conform to the format: last_name, first_name (last [comma space])

The main problem that I see here is not diacritics, but spaces. There are several names that consist of several words, for example. for titles. Therefore, you should go with the most general, that is, allow everything except a comma, which distinguishes first from the last name:

 /[^,]+,\s[^,]+/

But your second solution is with a character class . just as well, you may only need a few maths.

Specific Javascript Regex for accented characters (diacritics)

An explicit listing of all accented characters that I would like to accept as valid (lame and too complex):

My other approach was to use a character class . to have a simpler expression:

The last approach I just found might be simpler ...

More articles:

My other approach was to use a character class `.` to have a simpler expression: