What is a single character for regcomp? What multibyte encoding really defines this?

Question

What is a single character for regcomp? What multibyte encoding really defines this?

regcomp (from glibc) is a POSIX function for compiling regular expressions.

  int regcomp(regex_t *restrict preg, const char *restrict pattern, int cflags);

There are some constructs in regular expressions that depend on the idea of a single character, for example [abc] .

If a multibyte encoding is used and a multibyte letter is used in the expression, the interpretation would be different if it were considered as a byte sequence or a sequence of multibyte letters.

Here I illustrate this idea with grep (which should not be the same in this respect as the C regcomp function):

 $ { echo ; echo ; } | egrep '[]'  $ { echo ; echo ; } | LANG=C egrep '[]'   $

LANG is the default if any of the specific language variables are not set, so the question is: which one will affect the regcomp coding idea.

 $ locale LANG=ru_RU.utf8 LC_CTYPE="ru_RU.utf8" LC_NUMERIC="ru_RU.utf8" LC_TIME="ru_RU.utf8" LC_COLLATE="ru_RU.utf8" LC_MONETARY="ru_RU.utf8" LC_MESSAGES=POSIX LC_PAPER="ru_RU.utf8" LC_NAME="ru_RU.utf8" LC_ADDRESS="ru_RU.utf8" LC_TELEPHONE="ru_RU.utf8" LC_MEASUREMENT="ru_RU.utf8" LC_IDENTIFICATION="ru_RU.utf8" LC_ALL= $

0

regex posix glibc multibyte locale

imz - Ivan Zakharyaschev Nov 25 '16 at 16:47

source share

1 answer

imz - Ivan Zakharyaschev · Answer 1 · 2016-11-25T16:47:03+0000

As for grep (which should not have the same behavior as regcomp ), LC_CTYPE seems to be for this solution:

 $ { echo ; echo ; } | LANG=en_US.utf8 egrep '[]'  $ { echo ; echo ; } | LANG=en_US.utf8 LC_COLLATE=C egrep '[]'  $ { echo ; echo ; } | LANG=en_US.utf8 LC_CTYPE=C egrep '[]'   $

What is a single character for regcomp? What multibyte encoding really defines this?

More articles: