How to combine unicode characters with boost :: spirit?

Question

How to combine unicode characters with boost :: spirit?

How can I combine unicode utf8 characters with boost::spirit ?

For example, I want to recognize all the characters in this line:

 $ echo "   " | ./a.out

When I try this simple boost::spirit program, it will not match Unicode characters correctly:

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> namespace qi = boost::spirit::qi; int main() { std::cin.unsetf(std::ios::skipws); boost::spirit::istream_iterator begin(std::cin); boost::spirit::istream_iterator end; std::vector<char> letters; bool result = qi::phrase_parse( begin, end, // input +qi::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(char letter, letters) { std::cout << letter << " "; } std::cout << std::endl; }

It behaves like this:

 $ echo "   " | ./a.out | less <D0> <9D> <D0> <B0> <D0> <B1> <D0> <B5> <D1> <80> <D0> <B5> <D0> <B3> <D1> <83> <D0> <BF> <D1> <83> <D1> <81> <D1> <82> <D1> <8B> <D0> <BD> <D0> <BD> <D1> <8B> <D1> <85> <D0> <B2> <D0> <BE> <D0> <BB> <D0> <BD>

UPDATE:

Ok, I worked on this a bit more, and the following code works. First, it converts the input to an iterator of 32-bit Unicode characters (as recommended here ):

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/support_istream_iterator.hpp> #include <boost/foreach.hpp> #include <boost/regex/pending/unicode_iterator.hpp> namespace qi = boost::spirit::qi; int main() { std::string str = "   "; boost::u8_to_u32_iterator<std::string::const_iterator> begin(str.begin()), end(str.end()); typedef boost::uint32_t uchar; // a unicode code point std::vector<uchar> letters; bool result = qi::phrase_parse( begin, end, // input +qi::standard_wide::char_, // match every character qi::space, // skip whitespace letters); // result BOOST_FOREACH(uchar letter, letters) { std::cout << letter << " "; } std::cout << std::endl; }

Code prints Unicode code codes:

 $ ./a.out 1053 1072 1073 1077 1088 1077 1075 1091 1087 1091 1089 1090 1099 1085 1085 1099 1093 1074 1086 1083 1085

which seems correct, according to the official Unicode table.

Now, can someone tell me how to print the actual characters instead, given this Unicode code vector?

+8

c ++ boost parsing boost-spirit

Frank May 6, '12 at 21:45

source share

3 answers

You can not. The problem is not boost :: spirit, but Unicode is complex . char does not mean character, it means "byte". And even if you work at the code level, a user-perceived character can still be represented by more than one code point. (for example, desert 9 characters, but 10 code points. This may not be clear enough in Russian, although since it does not use diacritics widely, other languages do.)

To actually iterate over a user-perceived character (or grapheme clusters in Unicode terminology), you will need to use a specialized Unicode library, namely ICU.

However, what is the use of iterations in the real world over characters?

+1

ybungalobill May 6 '12 at 10:14

source share

In Boost 1.58, I can match any Unicode characters with this:

 *boost::spirit::qi::unicode::char_

I do not know how to define a specific range of Unicode characters.

0

Sergey Oct 6 '16 at 20:23

source share

sehe · Accepted Answer · 2012-05-07T07:31:46+0000

I don’t have much experience working with it, but, apparently, Spirit (the version of the SVN trunk) supports Unicode.

 #define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

See for example. An example of a sexpr parser that is in a schematic demonstration.

 BOOST_ROOT/libs/spirit/example/scheme

I believe this is based on a demonstration from a presentation by Bryce Lelbach ¹ which specifically demonstrates:

wchar support
utree attributes (still experimental)
S-expressions

There is an online article about S-expressions and variants .

¹ If this is true, here are the videos from this presentation and the slides (pdf) as found here (odp)

How to combine unicode characters with boost :: spirit?

More articles: