How do I specify a range of regular expression characters that will work in European languages ​​other than English?

I work with the Ruby regex engine. I need to write a regex that does this

WIKI_WORD = /\b([az][\w_]+\.)?[AZ][az]+[AZ]\w*\b/ 

but will also work in other European languages ​​than English. I do not think that the range of characters [az] will contain lowercase letters in German, etc.

+6
source share
2 answers
 WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u 

should work in Ruby 1.9. \p{Lu} and \p{Ll} are abbreviations for uppercase and lowercase Unicode letters. ( \w already includes an underscore)

See also this answer - you may need to run Ruby in UTF-8 mode for this to work, and your script might need to be encoded in UTF-8 as well.

+7
source
James Gray has written articles on Unicode, UTF-8, and Ruby 1.8.7 and 1.9.2. They are important for reading.

With Ruby 1.8.7, we could add:

 #!/usr/bin/ruby -kU require 'jcode' 

and get partial support for UTF-8.

From 1.9.2 you can use:

 # encoding: UTF-8 

as the second line of your source file, and this will show Ruby by default for UTF-8. We make gray recommendations with all the source from which we are writing now.

This will not affect the external encoding when reading / writing text, but only the encoding of the source code.

Ruby 1.9.2 does not extend the usual character classes \w , \w and \s to handle UTF-8 or Unicode. Like other comments and answers, in this case only the POSIX and Unicode characters in regex are used.

+1
source

All Articles