Does POSIX regex.h provide unicode characters or mostly non ascii?

Hi, I am using the standard regex library (regcomp, regexec ..). But now, on demand, I have to add Unicode support to my regex codes.

Does the standard regex library provide unicode characters or mostly not ascii? I researched on the Internet and don't think so.

My project is a critic of resources, so I do not want to use large libraries for it (ICU and Boost.Regex).

Any help would be appreciated.

+7
source share
3 answers

POSIX Regex seems to work correctly with the UTF-8 locale. I just wrote a simple test (see below) and used it to match a string with Cyrillic characters against the regular expression "[[:alpha:]]" (for example). And everything works fine.

Note: The main thing you should remember is that regular expression functions are related to the locale. Therefore, before this, you must call setlocale() .

 #include <sys/types.h> #include <string.h> #include <regex.h> #include <stdio.h> #include <locale.h> int main(int argc, char** argv) { int ret; regex_t reg; regmatch_t matches[10]; if (argc != 3) { fprintf(stderr, "Usage: %s regex string\n", argv[0]); return 1; } setlocale(LC_ALL, ""); /* Use system locale instead of default "C" */ if ((ret = regcomp(&reg, argv[1], 0)) != 0) { char buf[256]; regerror(ret, &reg, buf, sizeof(buf)); fprintf(stderr, "regcomp() error (%d): %s\n", ret, buf); return 1; } if ((ret = regexec(&reg, argv[2], 10, matches, 0)) == 0) { int i; char buf[256]; int size; for (i = 0; i < sizeof(matches) / sizeof(regmatch_t); i++) { if (matches[i].rm_so == -1) break; size = matches[i].rm_eo - matches[i].rm_so; if (size >= sizeof(buf)) { fprintf(stderr, "match (%d-%d) is too long (%d)\n", matches[i].rm_so, matches[i].rm_eo, size); continue; } buf[size] = '\0'; printf("%d: %d-%d: '%s'\n", i, matches[i].rm_so, matches[i].rm_eo, strncpy(buf, argv[2] + matches[i].rm_so, size)); } } return 0; } 

Usage example:

 $ locale LANG=ru_RU.UTF-8 LC_CTYPE="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" ... (skip) LC_ALL= $ ./reg '[[:alpha:]]' ' 359 ' 0: 5-7: '' $ 

The length of the matching result is two bytes, because the Cyrillic letters in UTF-8 take up so much.

+6
source

Basically, POSIX regular expressions do not support Unicode. You can try to use them in Unicode characters, but there may be problems with glyphs that have multiple encodings and other similar problems that you are familiar with Unicode libraries.

From the IEEE Std 1003.1-2008 standard :

Matching should be based on the bitmap used to encode the character, and not on the graphic representation of the character. This means that if the character set contains two or more encodings for a graphic character or if the searched strings contain text encoded in several codes, no attempts are made to search for any other representation of the encoded character. If necessary, the user can specify equivalence classes containing all variants of the desired graphic symbol.

Maybe libpcre will work for you? It is a bit heavier than POSIX regular expressions, but I would think it is lighter than ICU or Boost.

+6
source

If you really mean "Standard", i.e. std::regex from C ++ 11, then all you have to do is switch to std::wregex (and std::wstring , of course).

0
source

All Articles