Unicode and: alpha:
Why is this false :
iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/) false But is this true ?:
iex(2)> String.match?("汉语漢語", ~r/[[:alpha:]]/) true Sometimes [:alpha:] is unicode, and sometimes not?
EDIT:
I do not think my original example was clear enough.
Why is this false :
iex(1)> String.match?("汉", ~r/^[[:alpha:]]+$/) false But is this true ?:
iex(2)> String.match?("汉", ~r/[[:alpha:]]/) true When you pass a string to a regular expression in a mode other than Unicode, it is treated as an array of bytes, not a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes into which the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158 ) and IO.puts String.length("汉语漢語") (4, Unicode "letters"). The string contains bytes that cannot be matched with the character class [:alpha:] POSIX. Thus, the first expression does not work, and the second one works, since it only needs 1 character to return the correct match.
To correctly match Unicode strings with the PCRE regular expression library (which is used by Elixir), you need to enable Unicode mode with the /u modifier:
IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u) See IDEONE demo (prints true )
See Link to regex Elixir :
unicode (u)- allows you to use Unicode-specific patterns, such as\p, and change modifiers, such as\w,\w,\sand friends, also corresponding in unicode. He expects valid Unicode strings to be set by coincidence.