Multibyte characters in libc regcomp and regexec

Question

Multibyte characters in libc regcomp and regexec

Is there a way to get the libc6 functions regexp regcomp and regexec to work with multibyte characters correctly?

For example, if my pattern is utf8 猫机+猫 characters, the match search in utf8 encoded string 猫机机机猫 will fail, where it will be done.

I think this is because the representation of the 机 character is \xe6\x9c\xba , and + matches one or more bytes \xba . I can make this instance work by placing parentheses around each multibyte character in the template, but since this is for the application, I cannot require this from the user.

Is there a way to specify a pattern or string matching utf8 characters? Perhaps tell libc to save the template as wchar instead of char?

+7

regex glibc utf-8 libc

bill_e Jan 23 '15 at 17:52

source share

2 answers

Is there a way to specify a pattern or string matching utf8 characters?

I suspect that the LC_CTYPE environment variable (or other relevant locale settings) is a way to make regcomp / regexec understand your encoding .

At least the grep program seems to take this into account, as shown in / questions / 821983 / what-does-constitute-one-character-for-regcomp-which-multibyte-encoding-does-determine-this / 3000020 # 3000020 ; I have not tested this with regcomp function.

0

imz - Ivan Zakharyaschev Nov 26 '16 at 23:35

source share

Regular joe · Accepted Answer · 2015-02-21T09:02:22+0000

Can a regex be used to create a regex? Here is a javascript example (although I know you are not using js):

 function Examp () { var uString = "猫机+猫+猫ymg+sah猫"; var plussed = uString.replace(/(.)(?=[\+\*])/ig,"($1)"); console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed); uString = "猫机+猫*猫ymg+s\\a+I+h猫"; plussed = uString.replace(/(\\?.)(?=[\+\*])/ig,"($1)"); console.log("You can even take this a step further and account for a character being escaped, if that a consideration.") console.log("Starting with string: " + uString + "\r\n" + "Result: " + plussed); }

 <input type="button" value="Run" onclick="Examp()" />

Multibyte characters in libc regcomp and regexec

More articles: