Utf8 aware strncpy

It’s hard for me to believe that I was the first person to encounter this problem, but I was looking for a rather long time and did not find a solution for this.

I would like to use strncpy, but let it be UTF8, so it does not partially write the utf8 character to the destination string.

Otherwise, you can never be sure that the resulting string is valid UTF8, even if you know the source (when the source string is longer than the maximum length).

Checking the result string may work, but if you need to name a lot, it would be better to have a strncpy function that checks it.

glib has g_utf8_strncpy , but it copies a certain number of Unicode characters, while Im looks for a copy function that limits the length of the byte.

To be clear, "utf8 aware" means that it should not exceed the limit of the target buffer and should never copy only part of the utf-8 character. (Provided that a valid utf-8 input should never lead to invalid utf-8 outputs).


Note:

Some answers indicate that strncpy terminates all bytes and that it does not guarantee zero termination, in retrospect I should have requested the utf8 strlcpy value, however at the time I did not know about the existence of this function.

+8
c ++ c utf-8 strncpy
source share
6 answers

To answer my own question, there is a C function that I ended up with (not using C ++ for this project):

Notes: - Understand that this is not a strncpy clone for utf8, it is more like strlcpy from openbsd. - utf8_skip_data copied from glib gutf8.c - It does not check utf8 - this is what I intended.

Hope this is helpful to others and interested in feedback, but please, not a single pedantic fanatic about NULL behavior unless it is an actual error, or misleading / incorrect behavior.

Thanks to James Kanze, who served as the basis for this, but was also incomplete with C ++ (I need version C).

 static const size_t utf8_skip_data[256] = { 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1 }; char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy) { char *dst_r = dst; size_t utf8_size; if (maxncpy > 0) { while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) { maxncpy -= utf8_size; switch (utf8_size) { case 6: *dst ++ = *src ++; case 5: *dst ++ = *src ++; case 4: *dst ++ = *src ++; case 3: *dst ++ = *src ++; case 2: *dst ++ = *src ++; case 1: *dst ++ = *src ++; } } *dst= '\0'; } return dst_r; } 
+1
source share

I'm not sure what you mean by UTF-8; strncpy copies bytes, not characters, and the size of the buffer is also specified in bytes. If you mean that it only copies UTF-8 characters, stop, for example, if there is no space for the next character, I am not aware of such a function, but you should not write it:

 int utf8Size( char ch ) { static int const sizeTable[] = { // ... }; return sizeTable( static_cast<unsigned char>( ch ) ) } char* stru8ncpy( char* dest, char* source, int n ) { while ( *source != '\0' && utf8Size( *source ) < n ) { n -= utf8Size( *source ); switch ( utf8Size( ch ) ) { case 6: *dest ++ = *source ++; case 5: *dest ++ = *source ++; case 4: *dest ++ = *source ++; case 3: *dest ++ = *source ++; case 2: *dest ++ = *source ++; case 1: *dest ++ = *source ++; break; default: throw IllegalUTF8(); } } *dest = '\0'; return dest; } 

(The contents of the table in utf8Size is a little painful to generate, but this is a function that you will use a lot if you are dealing with UTF-8, and you only need to do this once.)

+6
source share

I tested this on many examples of UTF8 strings with multibyte characters. If the source is too long, it performs a reverse lookup (starts with a null terminator) and works backward to find the last full UTF8 character that can fit into the destination buffer. It always ensures that the assignment is completed with zero.

 char* utf8cpy(char* dst, const char* src, size_t sizeDest ) { if( sizeDest ){ size_t sizeSrc = strlen(src); // number of bytes not including null while( sizeSrc >= sizeDest ){ const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator. while( lastByte-- > src ) if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null). break; sizeSrc = lastByte - src; } memcpy(dst, src, sizeSrc); dst[sizeSrc] = '\0'; } return dst; } 
+6
source share

strncpy() is a terrible function:

  • If there is not enough space, the resulting row will not end with zero .
  • If there is enough space, the remainder is filled with zeros. This can be painful if the target line is very large.

Even if characters remain in the ASCII range (0x7f and below), the resulting string will not be what you want. In the case of UTF-8, this may not be nul-end completed and in an invalid UTF-8 sequence.

The best advice is to avoid strncpy() .

EDIT: ad 1):

 #include <stdio.h> #include <string.h> int main (void) { char buff [4]; strncpy (buff, "hello world!\n", sizeof buff ); printf("%s\n", buff ); return 0; } 

Agreed, the buffer will not be full. But the result is still undesirable. strncpy () solves only part of the problem. This is misleading and undesirable.

UPDATE (2012-10-31): Since this is an unpleasant problem, I decided to hack my version by imitating the ugly behavior of strncpy (). The return value is the number of characters to copy, but ..

 #include <stdio.h> #include <string.h> size_t utf8ncpy(char *dst, char *src, size_t todo); static int cnt_utf8(unsigned ch, size_t len); static int cnt_utf8(unsigned ch, size_t len) { if (!len) return 0; if ((ch & 0x80) == 0x00) return 1; else if ((ch & 0xe0) == 0xc0) return 2; else if ((ch & 0xf0) == 0xe0) return 3; else if ((ch & 0xf8) == 0xf0) return 4; else if ((ch & 0xfc) == 0xf8) return 5; else if ((ch & 0xfe) == 0xfc) return 6; else return -1; /* Default (Not in the spec) */ } size_t utf8ncpy(char *dst, char *src, size_t todo) { size_t done, idx, chunk, srclen; srclen = strlen(src); for(done=idx=0; idx < srclen; idx+=chunk) { int ret; for (chunk=0; done+chunk < todo; chunk++) { ret = cnt_utf8( src[idx+chunk], srclen - (idx+chunk) ); if (ret ==1) continue; /* Normal character: collect it into chunk */ if (ret < 0) continue; /* Bad stuff: treat as normal char */ if (ret ==0) break; /* EOF */ if (!chunk) chunk = ret;/* an UTF8 multibyte character */ else ret = 1; /* we allready collected a number (chunk) of normal characters */ break; } if (ret > 1 && done+chunk > todo) break; if (done+chunk > todo) chunk = todo - done; if (!chunk) break; memcpy( dst+done, src+idx, chunk); done += chunk; if (ret < 1) break; } /* This is part of the dreaded strncpy() behavior: ** pad the destination string with NULs ** upto its intended size */ if (done < todo) memset(dst+done, 0, todo-done); return done; } int main(void) { char *string = "Hell\xc3\xb6 \xf1\x82\x82\x82, world\xc2\xa1!"; char buffer[30]; unsigned result, len; for (len = sizeof buffer-1; len < sizeof buffer; len -=3) { result = utf8ncpy(buffer, string, len); /* remove the following line to get the REAL strncpy() behaviour */ buffer[result] = 0; printf("Chop @%u\n", len ); printf("Org:[%s]\n", string ); printf("Res:%u\n", result ); printf("New:[%s]\n", buffer ); } return 0; } 
+1
source share

Here is the C ++ solution:

u8string.h :

 #ifndef U8STRING_H #define U8STRING_H 1 #include <stddef.h> #ifdef __cplusplus extern "C" { #endif /** * Copies the first few characters of the UTF-8-encoded string pointed to by * \p src into \p dest_buf, as many UTF-8-encoded characters as can be written in * <code>dest_buf_len - 1</code> bytes or until the NUL terminator of the string * pointed to by \p str is reached. * * The string of bytes that are written into \p dest_buf is NUL terminated * if \p dest_buf_len is greater than 0. * * \returns \p dest_buf */ char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len); #ifdef __cplusplus } #endif #endif 

u8slbcpy.cpp :

 #include "u8string.h" #include <cstring> #include <utf8.h> char * u8slbcpy(char *dest_buf, const char *src, size_t dest_buf_len) { if (dest_buf_len <= 0) { return dest_buf; } else if (dest_buf_len == 1) { dest_buf[0] = '\0'; return dest_buf; } size_t num_bytes_remaining = dest_buf_len - 1; utf8::unchecked::iterator<const char *> it(src); const char * prev_base = src; while (*it++ != '\0') { const char *base = it.base(); ptrdiff_t diff = (base - prev_base); if (num_bytes_remaining < diff) { break; } num_bytes_remaining -= diff; prev_base = base; } size_t n = dest_buf_len - 1 - num_bytes_remaining; std::memmove(dest_buf, src, n); dest_buf[n] = '\0'; return dest_buf; } 

The u8slbcpy() function has a C interface, but it is implemented in C ++. My implementation uses only the UTF8-CPP library .

I think this is pretty much what you are looking for, but note that there is still a problem that one or more combining characters cannot be copied if the combining characters are applied to the n th character (itself not a combinational character), and the destination buffer is large enough to store the UTF-8 encoding of characters 1 through n, but not combine characters of the character n. In this case, bytes representing the characters from 1 to n are written, but none of the combining characters n is. In fact, you can say that the symbol n th is partially written.

+1
source share

To comment on the above answer, strncpy () is a terrible function: "I hate even commenting on such statements in clothes by creating another jihad for internet programming, but somehow, as such statements mislead those who might come here to find the answers.

Well, perhaps C's lowercase functions are "old school." Perhaps all lines in C / C ++ should be in some kind of smart containers, etc. Perhaps you should use C ++ instead of C (if you have a choice), this is more of a preference and argument for other topics.

I came here to find UTF-8 strncpy (). Not that I could not do this (the encoding is IMHO simple and elegant), but I wanted to see how others made them and, perhaps, find one optimized in ASM.

In the “gift of the gods” of people who create the world of programming, for a moment drop your trick and look at some facts.

There is nothing wrong with strncpy () or any other similar function with the same side effects and problems as _snprintf (), etc.

I say: "strncpy () is not scary," but rather, "scary programmers use it terribly."

What is “scary” is not knowing the rules. In addition, for the whole subject, due to security (for example, buffer overflows) and consequences for program stability, it would not be necessary, for example, Microsoft, to add CRT lib "Safe String Functions" to it if only the rules were followed.

The main ones:

  • "sizeof ()" returns the length of the static string w / terminator.
  • "strlen ()" returns the length of the string without a terminator.
  • In most cases, if all the "n" functions just clamp the "n" without adding a terminator.
  • There is an implicit ambiguity regarding the fact that the “buffer size” is in functions that require the size of the input buffer as well. I.E. Type types (char * pszBuffer, int iBufferSize). "It is safer to assume the worst and pass a size one smaller than the actual size of the buffer, and add a terminator at the end to be sure.
  • For string inputs, buffers, etc. Set and use a reasonable size based on your expected average and maximum values. To hopefully avoid clipping input and eliminate a buffer overflow period.

This is how I personally feel about such things and other rules that you only need to know and practice.

Convenient macro for static row size:

 // Size of a string with out terminator #define SIZESTR(x) (sizeof(x) - 1) 

When declaring local / stack line buffers:

A) The size, for example, is limited to 1023 + 1 for the terminator, so that the length of the lines is up to 1023 characters.

B) I initialize the string to zero in length, plus ending at the very end to cover the possible truncation of n.

 char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0; 

Alternatively, one could simply: char szBuffer[1024] = {0}; of course, but then there are some consequences for the compiler generated by memset (), like calling zero for the whole buffer. However, it makes things cleaner for debugging, and I prefer this style for static (vs local / stack) line buffers.

Now "strncpy ()", following the rules:

 char szBuffer[1024]; szBuffer[0] = szBuffer[SIZESTR(szBuffer)] = 0; strncpy(szBuffer, pszSomeInput, SIZESTR(szBuffer)); 

There are other “rules” and questions, of course, but these are the main ones that come to mind. You just need to know how lib functions work and use safe methods like this.

Finally, in my project, I used the ICU , so I decided to go with it and use the macros in "utf8.h" to make my own "strncpy ()".

0
source share

All Articles