Unicode and: alpha:

Question

Unicode and: alpha:

Why is this false :

 iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/) false

But is this true ?:

 iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/) true

Sometimes [:alpha:] is unicode, and sometimes not?

EDIT:

I do not think my original example was clear enough.

Why is this false :

 iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/) false

But is this true ?:

 iex(2)> String.match?("汉", ~r/[[:alpha:]]/) true

+6

regex elixir

mwoods79 Nov 07 '15 at 18:46

source share

1 answer

Wiktor stribiżew · Accepted Answer · 2015-11-07T20:25:31+0000

When you pass a string to a regular expression in a mode other than Unicode, it is treated as an array of bytes, not a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes into which the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158 ) and IO.puts String.length("汉语漢語") (4, Unicode "letters"). The string contains bytes that cannot be matched with the character class [:alpha:] POSIX. Thus, the first expression does not work, and the second one works, since it only needs 1 character to return the correct match.

To correctly match Unicode strings with the PCRE regular expression library (which is used by Elixir), you need to enable Unicode mode with the /u modifier:

 IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)

See IDEONE demo (prints true )

See Link to regex Elixir :

unicode (u) - allows you to use Unicode-specific patterns, such as \p , and change modifiers, such as \w , \w , \s and friends, also corresponding in unicode. He expects valid Unicode strings to be set by coincidence.

Unicode and: alpha:

EDIT:

More articles: