Boost regex: [: alpha:] and accented characters

I am trying to replace every non-absolute character in a string with " " using Boost:

 std::string sanitize(std::string &str) { boost::regex re; re.imbue(std::locale("fr_FR.UTF-8")); re.assign("[^[:alpha:]]"); str = boost::regex_replace(str, re, " "); return str; } int main () { std::string test = "(ça) /.2424,@ va très bien ?"; cout << sanitize(test) << endl; return 0; } 

The result is a va tr s bien , but I would like to get ça va très bien .

What am I missing?

+6
source share
1 answer

boost::regex::imbue does not do what you hope for - in particular, it will not work with boost :: regex with UTF-8. (Perhaps you could make it work with ISO 8859-1 or similar single-byte character encoding, but that doesn't look like what you want here).

To support UTF-8 you will need to use one of the boost :: regex classes that will work with Unicode - see http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex /unicode.html .

Here is some code that I think does what you want:

 #include <string> #include <boost/regex/icu.hpp> std::string sanitize(std::string &str) { boost::u32regex re = boost::make_u32regex("[^[:alpha:]]"); str = boost::u32regex_replace(str, re, " "); return str; } int main () { std::string test = "(ça) /.2424,@ va très bien ?"; std::cout << test << "\n" << sanitize(test) << std::endl; return 0; } 

See http://www.boost.org/doc/libs/1_55_0/libs/regex/doc/html/boost_regex/ref/non_std_strings/icu/unicode_algo.html for more details.

+6
source

All Articles