Why does underscore fall under \ w?

This may be a theoretical question.

Why the underscore _ is under \w in the regex and not under \w

I hope this is not based primarily on opinions, because there must be a reason.

Quoting would be great, if at all available.

+7
regex
source share
2 answers

From Wikipedia The regular expression in the article (emphasis mine):

An additional class other than POSIX, understood by some tools, is [:word:] , which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages ​​these are characters that can be used in identifiers .

In perl , tcl, and vim , this non-standard class is represented by \w (and characters outside this class are represented by \w ).

+8
source share

\w matches any single code point that has any of the following properties:

  • \p{GC=Alphabetic} (letters and a few more Unicode dots)

  • \p{GC=Mark} (Note: distance, non-expansion spanning)

  • \p{GC=Connector_Punctuation} (for example, underscore)

  • \p{GC=Decimal_Number} (numbers and other variants of numbers)

  • \p{Join_Control} (code points U + 0200C and U + 0200D)

These properties are used as part of the identifiers of a programming language in scripts. For example [1] :

Connector punctuation ( \p{GC=Connector_Punctuation} ) is added for programming language identifiers, thus adding "_" and similar characters.

There is [2] :

the general intention is that an identifier consists of a string of characters starting with a letter or ideograph, followed by any number of letters, ideographs, numbers or underscores.

\p{Join_Control} has recently been added to the \w character class and here is the message that the Perl developers exchanged to implement it, supporting my previous mention that \w used to compile identifiers.

+2
source share

All Articles