Is there any way to convert from UTF8 to iso-8859-1?

Question

Is there any way to convert from UTF8 to iso-8859-1?

My software gets some strings in UTF8 than I need to convert to ISO 8859 1. I know that the UTF8 domain is bigger than iso 8859. But the data in UTF8 was previously converted to ISO, so I should not miss anything.

I would like to know if there is a simple / direct way to convert from UTF8 to iso-8859-1.

thanks

+7

c linux embedded utf-8 character-encoding

fazineroso Jun 22 '12 at 12:47

source share

2 answers

iconv - perform character set conversion
size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);
iconv_t iconv_open(const char *tocode, const char *fromcode);

tocode is "ISO_8859-1" and fromcode is "UTF-8" . "UTF-8"

Working example:

 #include <iconv.h> #include <stdio.h> int main (void) { iconv_t cd = iconv_open("ISO_8859-1", "UTF-8"); if (cd == (iconv_t) -1) { perror("iconv_open failed!"); return 1; } char input[] = "Test äöü"; char *in_buf = &input[0]; size_t in_left = sizeof(input) - 1; char output[32]; char *out_buf = &output[0]; size_t out_left = sizeof(output) - 1; do { if (iconv(cd, &in_buf, &in_left, &out_buf, &out_left) == (size_t) -1) { perror("iconv failed!"); return 1; } } while (in_left > 0 && out_left > 0); *out_buf = 0; iconv_close(cd); printf("%s -> %s\n", input, output); return 0; }

+11

kay Jun 22 '12 at 12:49

source share

Nominal animal · Accepted Answer · 2012-06-23T22:31:24+0000

Here is a function you might utf8_to_latin9() useful: utf8_to_latin9() . It converts to ISO-8859-15 (including EURO, which does not have ISO-8859-1 ), but also works correctly for the conversion part of UTF-8 → ISO-8859-1 to ISO-8859-1 → UTF-8 → ISO-8859-1 round-trip.

The function ignores invalid code points similar to the //IGNORE flag for iconv, but does not recompose decomposed UTF-8 sequences; that is, it will not turn U+006E U+0303 into U+00F1 . I do not want to rebuild, because iconv does not work either.

The function is very careful about accessing strings. It will never scan outside the buffer. The output buffer must be one byte longer than the length, since it always appends the NUL byte of the end of the line. The function returns the number of characters (bytes) in the output, not counting the NUL bytes of the end of the line.

 /* UTF-8 to ISO-8859-1/ISO-8859-15 mapper. * Return 0..255 for valid ISO-8859-15 code points, 256 otherwise. */ static inline unsigned int to_latin9(const unsigned int code) { /* Code points 0 to U+00FF are the same in both. */ if (code < 256U) return code; switch (code) { case 0x0152U: return 188U; /* U+0152 = 0xBC: OE ligature */ case 0x0153U: return 189U; /* U+0153 = 0xBD: oe ligature */ case 0x0160U: return 166U; /* U+0160 = 0xA6: S with caron */ case 0x0161U: return 168U; /* U+0161 = 0xA8: s with caron */ case 0x0178U: return 190U; /* U+0178 = 0xBE: Y with diaresis */ case 0x017DU: return 180U; /* U+017D = 0xB4: Z with caron */ case 0x017EU: return 184U; /* U+017E = 0xB8: z with caron */ case 0x20ACU: return 164U; /* U+20AC = 0xA4: Euro */ default: return 256U; } } /* Convert an UTF-8 string to ISO-8859-15. * All invalid sequences are ignored. * Note: output == input is allowed, * but input < output < input + length * is not. * Output has to have room for (length+1) chars, including the trailing NUL byte. */ size_t utf8_to_latin9(char *const output, const char *const input, const size_t length) { unsigned char *out = (unsigned char *)output; const unsigned char *in = (const unsigned char *)input; const unsigned char *const end = (const unsigned char *)input + length; unsigned int c; while (in < end) if (*in < 128) *(out++) = *(in++); /* Valid codepoint */ else if (*in < 192) in++; /* 10000000 .. 10111111 are invalid */ else if (*in < 224) { /* 110xxxxx 10xxxxxx */ if (in + 1 >= end) break; if ((in[1] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x1FU)) << 6U) | ((unsigned int)(in[1] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 2; } else if (*in < 240) { /* 1110xxxx 10xxxxxx 10xxxxxx */ if (in + 2 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x0FU)) << 12U) | (((unsigned int)(in[1] & 0x3FU)) << 6U) | ((unsigned int)(in[2] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 3; } else if (*in < 248) { /* 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx */ if (in + 3 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x07U)) << 18U) | (((unsigned int)(in[1] & 0x3FU)) << 12U) | (((unsigned int)(in[2] & 0x3FU)) << 6U) | ((unsigned int)(in[3] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 4; } else if (*in < 252) { /* 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx */ if (in + 4 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U && (in[4] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x03U)) << 24U) | (((unsigned int)(in[1] & 0x3FU)) << 18U) | (((unsigned int)(in[2] & 0x3FU)) << 12U) | (((unsigned int)(in[3] & 0x3FU)) << 6U) | ((unsigned int)(in[4] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 5; } else if (*in < 254) { /* 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx */ if (in + 5 >= end) break; if ((in[1] & 192U) == 128U && (in[2] & 192U) == 128U && (in[3] & 192U) == 128U && (in[4] & 192U) == 128U && (in[5] & 192U) == 128U) { c = to_latin9( (((unsigned int)(in[0] & 0x01U)) << 30U) | (((unsigned int)(in[1] & 0x3FU)) << 24U) | (((unsigned int)(in[2] & 0x3FU)) << 18U) | (((unsigned int)(in[3] & 0x3FU)) << 12U) | (((unsigned int)(in[4] & 0x3FU)) << 6U) | ((unsigned int)(in[5] & 0x3FU)) ); if (c < 256) *(out++) = c; } in += 6; } else in++; /* 11111110 and 11111111 are invalid */ /* Terminate the output string. */ *out = '\0'; return (size_t)(out - (unsigned char *)output); }

Please note that you can add custom transliteration for specific code points in the to_latin9() function, but you are limited to a one-character replacement.

As it is currently written, a function can safely perform in-place conversions: input and output pointers can be the same. The output string will never be longer than the input string. If there is room for an extra byte in your input string (for example, it has a NUL ending the string), you can safely use the above function to convert it from UTF-8 to ISO-8859-1 / 15. I intentionally wrote it like this because it should save your efforts in the embedded environment, although this approach is a bit limited. customization and extension.

Edit:

I included a couple of conversion functions in editing this answer to convert Latin-1/9 to / from UTF-8 (ISO-8859-1 or -15 to / from UTF-8); the main difference is that these functions return a dynamically allocated copy and keep the original string intact.

Is there any way to convert from UTF8 to iso-8859-1?

More articles: