Proper utf8 regex for CamelCase (WikiWord) in perl

Here a question arose about the CamelCase remark . With the tchrist post combination, I wonder what the correct utf-8 CamelCase is .

Starting with regex (brian d foy's):

/ \b # start at word boundary [AZ] # start with upper [a-zA-Z]* # followed by any alpha (?: # non-capturing grouping for alternation precedence [az][a-zA-Z]*[AZ] # next bit is lower, any zero or more, ending with upper | # or [AZ][a-zA-Z]*[az] # next bit is upper, any zero or more, ending with lower ) [a-zA-Z]* # anything that left \b # end at word /x 

and changes:

 / \b # start at word boundary \p{Uppercase_Letter} # start with upper \p{Alphabetic}* # followed by any alpha (?: # non-capturing grouping for alternation precedence \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter} ### next bit is lower, any zero or more, ending with upper | # or \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter} ### next bit is upper, any zero or more, ending with lower ) \p{Alphabetic}* # anything that left \b # end at word /x 

You have a problem with lines labeled "###".

Also, how to change the regular expression for tolerance than numbers, and underscore is equivalent to lowercase letters, so W2X3 is a valid CamelCase word.

Updated: (ysth comment)

for the next one

  • any : means "upper or lower case or number or underscore"

The regular expression must match CamelWord, CaW

  • start with uppercase
  • optional
  • lowercase letter or number or underscore
  • optional
  • upper case letter
  • optional

Please do not mark as duplicate, because it is not. the original question (and the answers too) was conceived only by ascii.

+4
source share
1 answer

I really can't say what you are trying to do, but it should be closer to what your original intention was. I still cannot say what you want to do with it.

 m{ \b \p{Upper} # start with uppercase code point (NOT LETTER) \w* # optional ident chars # note that upper and lower are not related to letters (?: \p{Lower} \w* \p{Upper} | \p{Upper} \w* \p{Lower} ) \w* \b }x 

Never use [az] . And in fact, do not use \p{Lowercase_Letter} or \p{Ll} , since they do not match the more desirable and correct \p{Lowercase} and \p{Lower} .

And remember that \w really just an alias for

 [\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}] 
+5
source

All Articles