Here is a function you might utf8_to_latin9() useful: utf8_to_latin9() . It converts to ISO-8859-15 (including EURO, which does not have ISO-8859-1 ), but also works correctly for the conversion part of UTF-8 β ISO-8859-1 to ISO-8859-1 β UTF-8 β ISO-8859-1 round-trip.
The function ignores invalid code points similar to the //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it will not turn U+006E U+0303 into U+00F1 . I do not want to rebuild, because iconv does not work either.
The function is very careful about accessing strings. It will never scan outside the buffer. The output buffer must be one byte longer than the length, since it always appends the NUL byte of the end of the line. The function returns the number of characters (bytes) in the output, not counting the NUL bytes of the end of the line.
static inline unsigned int to_latin9(const unsigned int code) { if (code < 256U) return code; switch (code) { case 0x0152U: return 188U; case 0x0153U: return 189U; case 0x0160U: return 166U; case 0x0161U: return 168U; case 0x0178U: return 190U; case 0x017DU: return 180U; case 0x017EU: return 184U; case 0x20ACU: return 164U; default: return 256U; } } size_t utf8_to_latin9(char *const output, const char *const input, const size_t length) { unsigned char *out = (unsigned char *)output; const unsigned char *in = (const unsigned char *)input; const unsigned char *const end = (const unsigned char *)input + length; unsigned int c; while (in < end) if (*in < 128) *(out++) = *(in++); else if (*in < 192) in++; else if (*in < 224) { if (in + 1 >= end) break; if ((in[1] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x1FU)) << 6U) | ((unsigned int)(in[1] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 2; } else if (*in < 240) { if (in + 2 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x0FU)) << 12U) | (((unsigned int)(in[1] & 0x3FU)) << 6U) | ((unsigned int)(in[2] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 3; } else if (*in < 248) { if (in + 3 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x07U)) << 18U) | (((unsigned int)(in[1] & 0x3FU)) << 12U) | (((unsigned int)(in[2] & 0x3FU)) << 6U) | ((unsigned int)(in[3] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 4; } else if (*in < 252) { if (in + 4 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U && (in[4] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x03U)) << 24U) | (((unsigned int)(in[1] & 0x3FU)) << 18U) | (((unsigned int)(in[2] & 0x3FU)) << 12U) | (((unsigned int)(in[3] & 0x3FU)) << 6U) | ((unsigned int)(in[4] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 5; } else if (*in < 254) { if (in + 5 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U && (in[4] & 192U) == 128U && (in[5] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x01U)) << 30U) | (((unsigned int)(in[1] & 0x3FU)) << 24U) | (((unsigned int)(in[2] & 0x3FU)) << 18U) | (((unsigned int)(in[3] & 0x3FU)) << 12U) | (((unsigned int)(in[4] & 0x3FU)) << 6U) | ((unsigned int)(in[5] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 6; } else in++; *out = '\0'; return (size_t)(out - (unsigned char *)output); }
Please note that you can add custom transliteration for specific code points in the to_latin9() function, but you are limited to a one-character replacement.
As it is currently written, a function can safely perform in-place conversions: input and output pointers can be the same. The output string will never be longer than the input string. If there is room for an extra byte in your input string (for example, it has a NUL ending the string), you can safely use the above function to convert it from UTF-8 to ISO-8859-1 / 15. I intentionally wrote it like this because it should save your efforts in the embedded environment, although this approach is a bit limited. customization and extension.
Edit:
I included a couple of conversion functions in editing this answer to convert Latin-1/9 to / from UTF-8 (ISO-8859-1 or -15 to / from UTF-8); the main difference is that these functions return a dynamically allocated copy and keep the original string intact.
Nominal animal
source share