How to parse csv using boost :: spirit

I have this csv line

std::string s = R"(1997,Ford,E350,"ac, abs, moon","some "rusty" parts",3000.00)"; 

I can analyze it using boost::tokenizer :

 typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer; boost::escaped_list_separator<char> seps('\\', ',', '\"'); Tokenizer tok(s, seps); for (auto i : tok) { std::cout << i << std::endl; } 

Right, but the rusty token should have double quotes that get lost.

Here is my attempt to use boost :: spirit

 boost::spirit::classic::rule<> list_csv_item = !(boost::spirit::classic::confix_p('\"', *boost::spirit::classic::c_escape_ch_p, '\"') | boost::spirit::classic::longest_d[boost::spirit::classic::real_p | boost::spirit::classic::int_p]); std::vector<std::string> vec_item; std::vector<std::string> vec_list; boost::spirit::classic::rule<> list_csv = boost::spirit::classic::list_p(list_csv_item[boost::spirit::classic::push_back_a(vec_item)],',')[boost::spirit::classic::push_back_a(vec_list)]; boost::spirit::classic::parse_info<> result = parse(s.c_str(), list_csv); if (result.hit) { for (auto i : vec_item) { cout << i << endl; } } 

Problems:

  • does not work, prints only the first token

  • why boost :: spirit :: classic? can't find examples using Spirit V2

  • the setup is cruel .. but i can live with it

** I really want to use boost::spirit because it is pretty fast

Expected Result:

 1997 Ford E350 ac, abs, moon some "rusty" parts 

3,000.00

+7
c ++ boost csv boost-spirit boost-spirit-qi
source share
2 answers

The Sehe post looks honestly cleaner than mine, but I put it together a bit, so this is all the same:

 #include <boost/tokenizer.hpp> #include <boost/spirit/include/qi.hpp> namespace qi = boost::spirit::qi; int main() { const std::string s = R"(1997,Ford,E350,"ac, abs, moon",""rusty"",3000.00)"; // Tokenizer typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer; boost::escaped_list_separator<char> seps('\\', ',', '\"'); Tokenizer tok(s, seps); for (auto i : tok) std::cout << i << "\n"; std::cout << "\n"; // Boost Spirit Qi qi::rule<std::string::const_iterator, std::string()> quoted_string = '"' >> *(qi::char_ - '"') >> '"'; qi::rule<std::string::const_iterator, std::string()> valid_characters = qi::char_ - '"' - ','; qi::rule<std::string::const_iterator, std::string()> item = *(quoted_string | valid_characters ); qi::rule<std::string::const_iterator, std::vector<std::string>()> csv_parser = item % ','; std::string::const_iterator s_begin = s.begin(); std::string::const_iterator s_end = s.end(); std::vector<std::string> result; bool r = boost::spirit::qi::parse(s_begin, s_end, csv_parser, result); assert(r == true); assert(s_begin == s_end); for (auto i : result) std::cout << i << std::endl; std::cout << "\n"; } 

And it gives out:

 1997 Ford E350 ac, abs, moon rusty 3000.00 1997 Ford E350 ac, abs, moon rusty 3000.00 

Something worth noting . It does not implement the full CSV parser. You also want to look at escape characters or something else that is required for your implementation.

Also . If you are looking at the documentation, just so you know that in Qi 'a' equivalent to boost::spirit::qi::lit('a') , and "abc" equivalent to boost::spirit::qi::lit("abc") .

In double quotes: So, as Sehe points out in the comment above, he does not give a direct understanding of what the rules associated with "" in the input text mean. If you want all instances of "" not in the quote string to be converted to " , then something like the following would work.

 qi::rule<std::string::const_iterator, std::string()> double_quote_char = "\"\"" >> qi::attr('"'); qi::rule<std::string::const_iterator, std::string()> item = *(double_quote_char | quoted_string | valid_characters ); 
+5
source share

As a background for parsing (optionally) delimited and delimited fields, including different quotation marks ( ' , " ), see here:

  • Parse lines with boost :: spirit

For a very, very, very complete example, complete with support for partially quoted values ​​and a

 splitInto(input, output, ' '); 

which accepts "arbitrary" output containers and delimiter expressions, see here:

  • How to make my split work only on one real line and be able to skip the quoted parts of the line?

Turning to your exact question, assuming either cited or disordered fields (without partial quotes inside the field values) using Spirit V2:

Take the simplest "abstract data type" that could work:

 using Column = std::string; using Columns = std::vector<Column>; using CsvLine = Columns; using CsvFile = std::vector<CsvLine>; 

And a repeated double quote avoids the semantics of double quotes (as I pointed out in the comment), you should use something like:

 static const char colsep = ','; start = -line % eol; line = column % colsep; column = quoted | *~char_(colsep); quoted = '"' >> *("\"\"" | ~char_('"')) >> '"'; 

The following full test program prints

 [1997][Ford][E350][ac, abs, moon][rusty][3001.00] 

(Note that BOOST_SPIRIT_DEBUG defines for easy debugging). Watch Live on Coliru

Full demo

 //#define BOOST_SPIRIT_DEBUG #include <boost/spirit/include/qi.hpp> namespace qi = boost::spirit::qi; using Column = std::string; using Columns = std::vector<Column>; using CsvLine = Columns; using CsvFile = std::vector<CsvLine>; template <typename It> struct CsvGrammar : qi::grammar<It, CsvFile(), qi::blank_type> { CsvGrammar() : CsvGrammar::base_type(start) { using namespace qi; static const char colsep = ','; start = -line % eol; line = column % colsep; column = quoted | *~char_(colsep); quoted = '"' >> *("\"\"" | ~char_('"')) >> '"'; BOOST_SPIRIT_DEBUG_NODES((start)(line)(column)(quoted)); } private: qi::rule<It, CsvFile(), qi::blank_type> start; qi::rule<It, CsvLine(), qi::blank_type> line; qi::rule<It, Column(), qi::blank_type> column; qi::rule<It, std::string()> quoted; }; int main() { const std::string s = R"(1997,Ford,E350,"ac, abs, moon","""rusty""",3001.00)"; auto f(begin(s)), l(end(s)); CsvGrammar<std::string::const_iterator> p; CsvFile parsed; bool ok = qi::phrase_parse(f,l,p,qi::blank,parsed); if (ok) { for(auto& line : parsed) { for(auto& col : line) std::cout << '[' << col << ']'; std::cout << std::endl; } } else { std::cout << "Parse failed\n"; } if (f!=l) std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n"; } 
+8
source share

All Articles