How to analyze a sequence of integers stored in a text buffer?

Parsing text consisting of a sequence of integers from a stream in C ++ is quite simple: just decode them. When data is somehow obtained and easily accessible within the program, for example, they accept base64 encoded text (decoding is not a problem), the situation is slightly different. Data sits in a buffer inside the program and only needs to be decoded, not read. Of course, you can use std::istringstream :

 std::vector<int> parse_text(char* begin, char* end) { std::istringstream in(std::string(begin, end)); return std::vector<int>(std::istream_iterator<int>(in), std::istream_iterator<int>()); } 

Since many of these buffers are received, and they can be quite large, it is advisable not to copy the actual contents of the character array and, ideally, also not create a stream for each buffer. So the question is:

For a char buffer containing sequences (space separated, working with other delimiters is easily performed, for example, using a suitable manipulator) integers, how can they be decoded without copying the sequence and, if possible, without even creating std::istream ?

+6
source share
1 answer

Avoiding a copy of the buffer is easy to do with a custom stream buffer, which simply sets the receive area to use the buffer. The stream buffer doesn’t even need to redefine any of the virtual functions and just configure the internal buffer:

 class imemstream : private virtual std::streambuf , public std::istream { public: imemstream(char* begin, char* end) : std::streambuf() , std::istream(static_cast<std::streambuf*>(this)) { this->setg(begin, begin, end); } }; std::vector<int> parse_data_via_istream(char* begin, char* end) { imemstream in(begin, end); return std::vector<int>(std::istream_iterator<int>(in), std::istream_iterator<int>()); } 

This approach avoids copying the stream and uses the predefined functions std::istream . However, it creates a stream object. Using the appropriate update function, a stream / stream buffer can be expanded to a reset buffer and process multiple buffers.

To avoid creating a stream, you can use the basic functions from std::num_get<...> . Actual parsing is performed using one of the std::locale facets. Digital parsing for std::istream is done using std::num_get<char, std::istreambuf_iterator<char>> . This face does not help much, since it uses the sequence specified by std::istreambuf_iterator<char> , but you can create an instance of std::num_get<char, char const*> . It will not be part of the standard std::locale , but it is easy to create the corresponding std::locale and set it, for example, as a global std::locale object, primarily in main() :

 int main() { std::locale::global(std::locale(std::locale(), new std::num_get<char, char const*>())); ... 

Note that the std::locale object will clear the added face, i.e. there is no need to add a cleanup code: graphs are counted and released when the last std::locale holding a certain face disappears. To actually use a face, this, unfortunately, needs a std::ios_base object that can really be obtained only from some object flow. However, any stream can be used (although in a multi-threaded system, it should probably be a separate stream object in the stream to avoid random race conditions):

 char const* skipspace(char const* it, char const* end) { return std::find_if(it, end, [](unsigned char c){ return !std::isspace(c); }); } std::vector<int> parse_data_via_istream(std::ios_base& fmt, char const* it, char const* end) { std::vector<int> rc; std::num_get<char, char const*> const& ng = std::use_facet<std::num_get<char, char const*>>(std::locale()); std::ios_base::iostate error; for (long tmp; (it = ng.get(skipspace(it, end), end, fmt, error, tmp)) , error == std::ios_base::goodbit; ) { rc.push_back(tmp); } return rc; } 

Most of this just manages the error a bit and skips leading spaces: basically std::istream provides tools for automatically skipping spaces for formatted input and handling the necessary error protocol. There is a potentially small advantage to the approach described above with respect to receiving a face only once per buffer and avoiding the creation of an std::istream::sentry , as well as preventing the creation of a stream. Of course, the code assumes that some stream can be used to pass it as its subcategory std::ios_base& to provide parsing flags, such as the base to be used.

OK, this is quite a bit of code for something, which basically could be strtol() . The approach using std::num_get<char, char const*> has some flexibility that strtol() not offer:

  • Since the std::locale facet is used, which can be overridden to analyze arbitrary presentation formats, such as Roman numerals, it is more flexible with respect to input formats.
  • It’s easy to customize the use of thousands separators or change the decimal point (just change std::numpunct<char> in the std::locale used by fmt to set them).
  • The buffer should not be completed with a zero mark. For example, a continuous 8-digit character sequence can be analyzed by supplying it and it+8 as a range when calling std::num_get<char, char const*>::get() .

However, strtol() is probably a good approach for most applications. On the other hand, the above provides an alternative that may be useful in some contexts.

+5
source

All Articles