Correct line breaks in C

I am wondering what would be a good / effective way of delimiting a string that can contain basically any character. therefore, for example, I need to concatenate n lines that might look like this:

char *str_1 = "foo; for|* 1.234+\"@!`"; char *str_n = "bar; for|* 1.234+%\"@`"; 

for the final line like:

 char *str_final = "foo; for|* 1.234+\"@!`bar; for|* 1.234+%\"@`"; // split? 

What separator can be used to properly separate it?

Note that there can be more than two lines for concatenation.

I am open to suggestions.

thanks

+4
source share
6 answers

As my comments went on more and more, here is the complete answer:

Your char * buffer should store the length of the string in the first bytes of X (for example, as Pascal does). After this length, string data appears that can contain any characters that you like. After that, the next X bytes indicate the length of the next line. So, and so on and so forth, to the end, which is limited to an empty string (i.e., the last X bytes say that the next line has zero length, and your application takes this as a signal to stop looking for more lines).

One of the advantages is that you do not need to scan string data - finding the next line from the beginning of the first line takes O (1) time, finding how many lines in the list are O (n), but it will be incredibly fast anyway (if O (n) is unacceptable, you can get around this, but I don’t think it is worth going into right now).

Another advantage is that the string data can contain any character you like. This may be because if your string can contain a NUL character, you can safely retrieve it, but you must be careful not to pass it to the C string function (e.g. strlen() or strcat() ) that will see the character NUL at the end of your data (which it may or may not be). You will have to rely on memcpy() and pointer arithmetic.

The problem is the value of X (the number of bytes you use to store the length of the string). The easiest way would be 1 to get around all the problems of judgment and alignment, but would limit your lines to 255 characters. If this is a limitation, you can live fine, but 255 seems a little low to me.

X may be 2 or 4 bytes, but you will need to make sure that you have a (unsigned) data type that at least contains as many bytes ( stdint.h uint16_t or uint32_t , or maybe uint_least16_t or uint_least32_t ). The best solution would be to make X = sizeof(size_t) , since the type size_t guaranteed to be able to store the length of any string that you could save.

If X > 1 introduces alignment, and if network portability is a problem, endianness. The easiest way to read the first X bytes as a size_t variable is to pass your char * data to size_t * and just dereference. However, if you cannot guarantee the correct alignment of your char * data, this may disrupt some systems. Even if you guarantee alignment of your char * data, you will have to spend a few bytes at the end of most lines to make sure that the value of the next line length is aligned.

The easiest way to overcome alignment is to manually convert the first sizeof(size_t) bytes to a size_t value. You will need to decide whether you want to store data a little or big-endian. Most computers will be targeted at small numbers, but for manual conversion it does not matter - just select one. The number 65537 (2 ^ 16 + 2) stored in 4 bytes, big-endian, looks like { 0, 1, 0, 2 } ; little-endian, { 2, 0, 1, 0 } .

Once you decide that (it doesn’t matter, choose what you like), you simply discard the first X data points to unsigned char s, then to size_t , then do a bit-shift with the appropriate exponent to put them in the right place, and then add them all together. In the above examples, 0 was multiplied by 2 ^ 32, 1 by 2 ^ 16, 0 by 2 ^ 8 and 2 by 2 ^ 0 (or 1), producing 0 + 65536 + 0 + 2 or 65537. The difference will probably be zero in efficiency between large and little-endian, if you do a manual conversion - I want to indicate (again) that the choice is completely arbitrary, as far as I can tell.

Performing a manual conversion avoids alignment problems and completely bypasses concerns about the nature of the intersystem system, so the data transferred from a computer with a mini-terminal to a large one will be considered the same. There is still a potential problem with transferring data from the system, where sizeof(size_t) == 4 , where sizeof(size_t) == 8 . If this is a problem, you can either a) take size_t and select the size of the invariant, or b) encode (one byte, all you need) the sizeof(size_t) value for the sender as the first byte of data, and the receiver has the necessary adjustments. Choosing a) can be simpler, but it can cause problems (what if you chose too small a size to account for outdated computers on your network, and as they phase out, do you lose storage space for your data?), Therefore I would prefer b), because it scales with any system you are working on (16-bit, 32-bit, 64-bit, maybe even in the future, 128-bit), but for this you may not would need.

</vomit> I leave this to the reader to understand everything that I just wrote.

+3
source

Perhaps you could encode the length of the string, followed by a special character before each string? Thus, you do not need to worry about which characters are in the next N characters. It might be a good idea to exclude and complete each substring.

The only advantage of this approach is that you can parse the string pretty quickly.

EDIT: An even better approach is to use the first 2-4 bytes, as suggested by Chris in the comment below, instead of the encoded length + special character.

+3
source

One option is to use a null character as a separator, and a double zero terminates the list. lines. It will look something like this:

 const char* str_final = "foo; for|* 1.234+\"@!`\0bar; for|* 1.234+%\"@`\0"; delimiter ^ delimiter ^ 

Raymond Chen gave a good overview of the zero-completion line in a blog post. It is used by several functions in the Windows API.

+2
source

If you know that your lines will always be valid UTF-8 (or ASCII) text, you can use a byte that cannot be displayed in actual UTF-8 (or ASCII) as a delimiter. In UTF-8, bytes C0, C1, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, and FF are invalid. In ASCII, any byte with a high bit set is invalid.

+2
source

One solution is to choose an escape character and delimiter. Usually, the backslash \ used as an escape character, but this can be confusing since it is already an escape character for string literals. The choice doesn’t really matter, let / be the separator and semicolon ; - as a separator. Ideally chose the two characters that are least likely in your lines.

When you concatenate strings, the first step is to search for both characters in unencrypted strings and replace them with an escaped version:

 str1 = "foo;bar;baz"; str2 = "foo/bar/baz"; 

becomes

 estr1 = "foo/;bar/;baz"; estr2 = "foo//bar//baz"; 

Then they are concatenated with a separator:

 res = "foo/;bar/;baz;foo//bar//baz"; 

What is it. Separation is performed by searching for a separator without a leading escape character, and then replacing the escaped characters in single strings back with the unshielded version.

This is a good choice if you want to work with strings with functions that are waiting for a single line with zero completion, for example. using str functions or print them using printf functions. If you can guarantee that only your own functions will work with these lines, then the specified separation with zeros \0 more efficient, especially since you really do not need to split it, you can use a pointer to a full line to use one partial line from it when using the str or printf functions.

+2
source

2 ideas:

1) Use the standard "escape" approach, something similar to the definition of a char * literal in C.

2) Use one '\0' character as a delimiter, and two of them as the end of a line marker.

+1
source

All Articles