What is a single character for regcomp? What multibyte encoding really defines this?

regcomp (from glibc) is a POSIX function for compiling regular expressions.

  int regcomp(regex_t *restrict preg, const char *restrict pattern, int cflags); 

There are some constructs in regular expressions that depend on the idea of ​​a single character, for example [abc] .

If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it were considered as a byte sequence or a sequence of multibyte letters.

Here I illustrate this idea with grep (which should not be the same in this respect as the C regcomp function):

 $ { echo ; echo ; } | egrep '[]'  $ { echo ; echo ; } | LANG=C egrep '[]'   $ 

LANG is the default if any of the specific language variables are not set, so the question is: which one will affect the regcomp coding idea.

 $ locale LANG=ru_RU.utf8 LC_CTYPE="ru_RU.utf8" LC_NUMERIC="ru_RU.utf8" LC_TIME="ru_RU.utf8" LC_COLLATE="ru_RU.utf8" LC_MONETARY="ru_RU.utf8" LC_MESSAGES=POSIX LC_PAPER="ru_RU.utf8" LC_NAME="ru_RU.utf8" LC_ADDRESS="ru_RU.utf8" LC_TELEPHONE="ru_RU.utf8" LC_MEASUREMENT="ru_RU.utf8" LC_IDENTIFICATION="ru_RU.utf8" LC_ALL= $ 
0
regex posix glibc multibyte locale
source share
1 answer

As for grep (which should not have the same behavior as regcomp ), LC_CTYPE seems to be for this solution:

 $ { echo ; echo ; } | LANG=en_US.utf8 egrep '[]'  $ { echo ; echo ; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[]'  $ { echo ; echo ; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[]'   $ 
0
source share

All Articles