How do I specify a range of regular expression characters that will work in European languages other than English?

Question

How do I specify a range of regular expression characters that will work in European languages other than English?

I work with the Ruby regex engine. I need to write a regex that does this

WIKI_WORD = /\b([az][\w_]+\.)?[AZ][az]+[AZ]\w*\b/

but will also work in other European languages than English. I do not think that the range of characters [az] will contain lowercase letters in German, etc.

+6

ruby regex unicode internationalization

dan Feb 15 '11 at 14:16

source share

2 answers

James Gray has written articles on Unicode, UTF-8, and Ruby 1.8.7 and 1.9.2. They are important for reading.

With Ruby 1.8.7, we could add:

 #!/usr/bin/ruby -kU require 'jcode'

and get partial support for UTF-8.

From 1.9.2 you can use:

 # encoding: UTF-8

as the second line of your source file, and this will show Ruby by default for UTF-8. We make gray recommendations with all the source from which we are writing now.

This will not affect the external encoding when reading / writing text, but only the encoding of the source code.

Ruby 1.9.2 does not extend the usual character classes \w , \w and \s to handle UTF-8 or Unicode. Like other comments and answers, in this case only the POSIX and Unicode characters in regex are used.

+1

the tin man Feb 15 '11 at 18:46

source share

Tim pietzcker · Accepted Answer · 2011-02-15T14:51:36+0000

 WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u

should work in Ruby 1.9. \p{Lu} and \p{Ll} are abbreviations for uppercase and lowercase Unicode letters. ( \w already includes an underscore)

See also this answer - you may need to run Ruby in UTF-8 mode for this to work, and your script might need to be encoded in UTF-8 as well.

How do I specify a range of regular expression characters that will work in European languages ​​other than English?

More articles:

How do I specify a range of regular expression characters that will work in European languages other than English?