In C ++ 11, the most efficient way to return a link / pointer to a position in std :: string?

I am creating a text parser that uses std::string as the main storage for strings.

I know that this is not optimal, and that parsers inside compilers use optimized approaches for this. In my project, I do not mind losing productivity in exchange for greater clarity and ease of maintenance.

First, I read a huge text in memory, and then I look through each character to create an ordered set of tokens, its a simple lexer. I am currently using std::string to represent the token text, but I would like to improve this a bit by using the link / pointer to the source text.

From what I read, it is a bad practice to return and hold iterators, as well as bad practice to refer to the std::string internal buffer.

Any suggestions on how to do this in a β€œclean” way?

+7
c ++ stdstring c ++ 11
source share
4 answers

There are suggestions to add string_view in C ++ to the upcoming standard.

A string_view is an unnamed iterative range over characters with many utilities and properties that you expect from a string class, except that you cannot insert / delete characters (and editable characters are often blocked in some subtypes).

I would suggest trying this approach - write your own (in your own namespace). (In any case, you should have your own namespace for generic code snippets.)

The main data is a couple of versions of char* pr std::string::iterator (or const ). If the user needs a zero-terminated buffer, the to_string method allocates one. I would start with non-mutable ( const ) character data. Remember to begin and end : this makes your view iterable with for(:) loops.

This construct has the danger that the original std::string must persist long enough to survive all representations.

If you are ready to give up any performance for security, you have a view std::shared_ptr<const std::string> that it can move std::string to, and as a first step move the entire buffer to it, and then start grinding / disassembling it. (child views create a new common pointer to the same data). Then your view class is more like an immutable row with a shared repository.

The top of the shared_ptr<const> version includes security, a longer viewing life (no longer depends on life expectancy), and you can easily redirect methods like const substring std::string so that you can write less code.

Disadvantages include possible incompatibilities with the input standard value of 1 and lower performance, as you drag shared_ptr around.

I suspect that looks and ranges will be of increasing importance in modern C ++ with upcoming and recent improvements in the language.

boost::string_ref apparently an implementation of a proposal for the C ++ 1y standard.


1 however, given how easy it is to add features to the metaprogramming of the template, having the template argument "resource owner" for the presentation type may be a good design decision. Then you can have your own and not owning string_view with another identical semantics ...

+10
source share

Some here:

- The internal representation of the string is saved at the same time as the string itself, if you keep the pointer or iterators in the string to use the latter (for example: print reports, post-processing, etc.) in the string area, invalid memory access. Typically, in this type of processing, text is retained throughout the process.
- Iterators are a good choice (for extreme performance and generality, I suggest using a const raw pointer const char* , because the source can be almost anything, a string, a buffer, a displayed buffer, read data from a stream, etc.)
- Good practice is to copy tokens, save a pair (token start iterator, token end iterator) in the token collection.
- For performance, you do not need to make many distributions (alloc is one of the most expensive operations in any language)

You can check lexertl (more ideas or use it): http://www.benhanson.net/lexertl.html and spirit (more complete): http://www.boost.org/doc/libs/release/libs/ spirit /

+6
source share

Returning and using iterators is not a bad practice. Of course, assuming that you are not modifying the input buffer, but it is not like you.

+4
source share

I can be considered a pagan here, but while you are working with const reference until the actual string , then I see no reason not to use const char* in string data (while you're using C ++ 11).

According to the C ++ 11 standard, the internal std::string data must be contiguous, and no pointers can be invalidated unless the string is exposed to processes with a non-constant reference.

21.4.1 General requirements of basic_string

5 The char objects in the basic_string object are stored contiguously. That is, for any basic_string s object, the identifier & * (s.begin () + n) == & * s.begin () + n holds for all values ​​of n such that 0 <= n <s.size ().

6 References, pointers, and iterators related to basic_string elements can be invalidated by the following ways of using this basic_string object:

- as an argument to any standard library function that references the non-const basic_string reference as an argument.

- Call non-constant member functions, except for the operator [], at, front, back, start, rbegin, end and rend.

So instead of using s.data (), use & s.begin () to get the actual internal buffer.

NOTE. I am sure that these guarantees are not retained for previous versions of the standard.

0
source share

All Articles