Boost Spirit: a sub grammatical addition to a line?

I play with Boost.Spirit . As part of a larger work, I am trying to build a grammar for parsing C / C ++ style string literals. I ran into a problem:

How to create a subgram that adds the result of std::string() to the calling grammar attribute std::string() (instead of a simple char ?

Here is my code that still works. (Actually, I already had a lot more, including things like '\n' , etc., but I reduced it to what I needed.)

 #define BOOST_SPIRIT_UNICODE #include <string> #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix_operator.hpp> using namespace boost; using namespace boost::spirit; using namespace boost::spirit::qi; template < typename Iterator > struct EscapedUnicode : grammar< Iterator, char() > // <-- should be std::string { EscapedUnicode() : EscapedUnicode::base_type( escaped_unicode ) { escaped_unicode %= "\\" > ( ( "u" >> uint_parser< char, 16, 4, 4 >() ) | ( "U" >> uint_parser< char, 16, 8, 8 >() ) ); } rule< Iterator, char() > escaped_unicode; // <-- should be std::string }; template < typename Iterator > struct QuotedString : grammar< Iterator, std::string() > { QuotedString() : QuotedString::base_type( quoted_string ) { quoted_string %= '"' >> *( escaped_unicode | ( char_ - ( '"' | eol ) ) ) >> '"'; } EscapedUnicode< Iterator > escaped_unicode; rule< Iterator, std::string() > quoted_string; }; int main() { std::string input = "\"foo\u0041\""; typedef std::string::const_iterator iterator_type; QuotedString< iterator_type > qs; std::string result; bool r = parse( input.cbegin(), input.cend(), qs, result ); std::cout << result << std::endl; } 

This prints fooA - the fooA grammar QuotedString grammar, which causes char to be added to the std::string QuotedString ( A , 0x41 ).

But of course, I will need to generate a sequence of characters (bytes) for anything other than 0x7f. EscapedUnicode could create a std::string , which should be added to the line generated by QuotedString .

And this is where I met the checkpoint. I don’t understand what Boost.Spirit does with Boost.Phoenix, and any attempts that I made led to long and rather elusive compiler errors related to templates.

So how can I do this? In fact, the answer does not require a proper Unicode conversion; this is a std::string question I need a solution for.

+5
source share
1 answer

Several points applied:

  • please do not use using namespace in relation to high code. ADL will ruin your day if you do not control it
  • The %= operator is the purpose of an automatic rule, which means that the automatic distribution of attributes will be forced even if there are semantic actions. You do not want this because the attribute opened by uint_parser will not automatically (automatically) propagate if you want to encode into a multibyte string representation.
  • Input line

     std::string input = "\"foo\u0041\""; 

    is necessary

     std::string input = "\"foo\\u0041\""; 

    otherwise, the compiler performed the transition processing before the parser even performed :)

Here are the specific tricks for the meat task:

  • You want to change the attribute of the declared rule to the fact that the Spirit will automatically β€œsmooth out” in simple sequences. For instance.

     quoted_string = '"' >> *(escaped_unicode | (qi::char_ - ('"' | qi::eol))) >> '"'; 

    It will not be added, because the first branch of the alternative result is obtained in the char sequence, and the second in one char. The following spelling is equivalent:

     quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"'; 

    subtly launches the heuristic of adding to Spirit, so we can achieve what we want without the participation of semantic actions .

The rest is straightforward:

  • implement actual coding using the Phoenix function object:

     struct encode_f { template <typename...> struct result { using type = void; }; template <typename V, typename CP> void operator()(V& a, CP codepoint) const { // TODO implement desired encoding (eg UTF8) bio::stream<bio::back_insert_device<V> > os(a); os << "[" << std::hex << std::showbase << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]"; } }; boost::phoenix::function<encode_f> encode; 

    Then you can use like:

     escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ]) | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) ); 

    Since you mentioned that you are not interested in a particular encoding, I decided to encode the source code in a 16-bit or 32-bit hexadecimal representation, for example [0x0041] . I pragmatically used Boost Iostreams, able to directly write to the attribute container type

  • Use macros BOOST_SPIRIT_DEBUG*

Live on coliru

 //#define BOOST_SPIRIT_UNICODE //#define BOOST_SPIRIT_DEBUG #include <string> #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> // for demo re-encoding #include <boost/iostreams/device/back_inserter.hpp> #include <boost/iostreams/stream.hpp> #include <iomanip> namespace qi = boost::spirit::qi; namespace bio = boost::iostreams; namespace phx = boost::phoenix; template <typename Iterator, typename Attr = std::vector<char> > // or std::string for that matter struct EscapedUnicode : qi::grammar<Iterator, Attr()> { EscapedUnicode() : EscapedUnicode::base_type(escaped_unicode) { using namespace qi; escaped_unicode = '\\' > ( ("u" >> uint_parser<uint16_t, 16, 4, 4>() [ encode(_val, _1) ]) | ("U" >> uint_parser<uint32_t, 16, 8, 8>() [ encode(_val, _1) ]) ); BOOST_SPIRIT_DEBUG_NODES((escaped_unicode)) } struct encode_f { template <typename...> struct result { using type = void; }; template <typename V, typename CP> void operator()(V& a, CP codepoint) const { // TODO implement desired encoding (eg UTF8) bio::stream<bio::back_insert_device<V> > os(a); os << "[0x" << std::hex << std::setw(std::numeric_limits<CP>::digits/4) << std::setfill('0') << codepoint << "]"; } }; boost::phoenix::function<encode_f> encode; qi::rule<Iterator, Attr()> escaped_unicode; }; template <typename Iterator> struct QuotedString : qi::grammar<Iterator, std::string()> { QuotedString() : QuotedString::base_type(start) { start = quoted_string; quoted_string = '"' >> *(escaped_unicode | +(qi::char_ - ('"' | qi::eol | "\\u" | "\\U"))) >> '"'; BOOST_SPIRIT_DEBUG_NODES((start)(quoted_string)) } EscapedUnicode<Iterator> escaped_unicode; qi::rule<Iterator, std::string()> start; qi::rule<Iterator, std::vector<char>()> quoted_string; }; int main() { std::string input = "\"foo\\u0041\\U00000041\""; typedef std::string::const_iterator iterator_type; QuotedString<iterator_type> qs; std::string result; bool r = parse( input.cbegin(), input.cend(), qs, result ); std::cout << std::boolalpha << r << ": '" << result << "'\n"; } 

Print

 true: 'foo[0x0041][0x00000041]' 
+5
source

All Articles