Proper utf8 regex for CamelCase (WikiWord) in perl

Question

Proper utf8 regex for CamelCase (WikiWord) in perl

Here a question arose about the CamelCase remark . With the tchrist post combination, I wonder what the correct utf-8 CamelCase is .

Starting with regex (brian d foy's):

/ \b # start at word boundary [AZ] # start with upper [a-zA-Z]* # followed by any alpha (?: # non-capturing grouping for alternation precedence [az][a-zA-Z]*[AZ] # next bit is lower, any zero or more, ending with upper | # or [AZ][a-zA-Z]*[az] # next bit is upper, any zero or more, ending with lower ) [a-zA-Z]* # anything that left \b # end at word /x

and changes:

 / \b # start at word boundary \p{Uppercase_Letter} # start with upper \p{Alphabetic}* # followed by any alpha (?: # non-capturing grouping for alternation precedence \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper | # or \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower ) \p{Alphabetic}* # anything that left \b # end at word /x

You have a problem with lines labeled "###".

Also, how to change the regular expression for tolerance than numbers, and underscore is equivalent to lowercase letters, so W2X3 is a valid CamelCase word.

Updated: (ysth comment)

for the next one

any : means "upper or lower case or number or underscore"

The regular expression must match CamelWord, CaW

start with uppercase
optional
lowercase letter or number or underscore
optional
upper case letter
optional

Please do not mark as duplicate, because it is not. the original question (and the answers too) was conceived only by ascii.

+4

regex perl unicode utf-8 camelcasing

jm666 Jun 12 '11 at 15:52

source share

1 answer

tchrist · Accepted Answer · 2011-06-12T18:19:13+0000

I really can't say what you are trying to do, but it should be closer to what your original intention was. I still cannot say what you want to do with it.

 m{ \b \p{Upper} # start with uppercase code point (NOT LETTER) \w* # optional ident chars # note that upper and lower are not related to letters (?: \p{Lower} \w* \p{Upper} | \p{Upper} \w* \p{Lower} ) \w* \b }x

Never use [az] . And in fact, do not use \p{Lowercase_Letter} or \p{Ll} , since they do not match the more desirable and correct \p{Lowercase} and \p{Lower} .

And remember that \w really just an alias for

 [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

Proper utf8 regex for CamelCase (WikiWord) in perl

More articles: