How can I combine unicode utf8 characters with boost::spirit ?
For example, I want to recognize all the characters in this line:
$ echo " " | ./a.out
When I try this simple boost::spirit program, it will not match Unicode characters correctly:
#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> namespace qi = boost::spirit::qi; int main() { std::cin.unsetf(std::ios::skipws); boost::spirit::istream_iterator begin(std::cin); boost::spirit::istream_iterator end; std::vector<char> letters; bool result = qi::phrase_parse( begin, end, // input +qi::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(char letter, letters) { std::cout << letter << " "; } std::cout << std::endl; }
It behaves like this:
$ echo " " | ./a.out | less <D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> <B2> <D0> <BE> <D0> <BB> <D0> <BD>
UPDATE:
Ok, I worked on this a bit more, and the following code works. First, it converts the input to an iterator of 32-bit Unicode characters (as recommended here ):
#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> #include <boost/regex/pending/unicode_iterator.hpp> namespace qi = boost::spirit::qi; int main() { std::string str = " "; boost::u8_to_u32_iterator<std::string::const_iterator> begin(str.begin()), end(str.end()); typedef boost::uint32_t uchar; // a unicode code point std::vector<uchar> letters; bool result = qi::phrase_parse( begin, end, // input +qi::standard_wide::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(uchar letter, letters) { std::cout << letter << " "; } std::cout << std::endl; }
Code prints Unicode code codes:
$ ./a.out 1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085
which seems correct, according to the official Unicode table.
Now, can someone tell me how to print the actual characters instead, given this Unicode code vector?
c ++ boost parsing boost-spirit
Frank
source share