How to use Unicode range in C ++ regex

I need to use a Unicode range in a regex in C ++. Basically, I need the regex to accept all valid Unicode characters. I just tried using a test expression and ran into some problems with it.


std::regex reg("^[\\u0080-\\uDB7Fa-z0-9!#$%&'*+/=?^_`{|}~-]+$");

Problem with \\u?

+4
source share
1 answer

This should work fine, but you need to use std::wregexand std::wsmatch. You will need to convert the original string and regular expression to wide-angle unicode (UTF-32 on Linux, UTF-16 (ish) on Windows) to make it work.

This works for me where the source code is UTF-8:

inline std::wstring from_utf8(const std::string& utf8)
{
    // code to convert from utf8 to utf32/utf16
}

inline std::string to_utf8(const std::wstring& ws)
{
    // code to convert from utf32/utf16 to utf8
}

int main()
{
    std::string test = "john.doe@神谕.com"; // utf8
    std::string expr = "[\\u0080-\\uDB7F]+"; // utf8

    std::wstring wtest = from_utf8(test);
    std::wstring wexpr = from_utf8(expr);

    std::wregex we(wexpr);
    std::wsmatch wm;
    if(std::regex_search(wtest, wm, we))
    {
        std::cout << to_utf8(wm.str(0)) << '\n';
    }
}

Conclusion:

神谕

. UTF, strong > .

: , :

++?

+3

All Articles