Multibyte (non ASCII) character handling in C

I am trying to execute my own version of wc (unix filter), but I have a problem with non-ASCII characters. I made a DEX text file and found that these characters occupy more than one byte, so they will not match char. Is there any way I can read these characters from a file and treat them as a single character (so that I can read the characters in the file) in C? I searched the language a bit and found some type of wchar_t, but there was no simple example of how to use it with files.

+5
source share
5 answers

wchar_t, , .

. , , , .

: UTF-8 ( Unicode, , ASCII), C UTF-8, ( ) .

" Unicode C?", ICU. ustdio.h, u_fgetc Unicode , , , u_.

, , , ( !) Joel On Software.

ICU, , , : -)

+8

C wc, , tdio wchar_t. setlocale():

setlocale(LC_CTYPE, "");

, , - . Unix- , LANG. , , LANG ​​ UTF8, UTF8. ( POSIX wc ).

. , :

long words = 0;
int in_word = 0;
int c;

while ((c = getchar()) != EOF)
{
    if (isspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}

... , c wint_t, getchar() getwchar(), EOF WEOF isspace() iswspace():

long words = 0;
int in_word = 0;
wint_t c;

while ((c = getwchar()) != WEOF)
{
    if (iswspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}
+4

Go ICU. - , .

+2

, , , :

  • , getwchar() .
  • , , mbrtowc .
  • UTF-8, . , UTF-8, 00-7F C2-F4 , . Unicode .

, .

+1

? wc .

~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt 
 1  1 11 hebrew.txt

(11 = 5 + 1 '\n')

, , , , UTF-8, - , (.. 0x80 0xBF).

UTF-8, , , UTF-8, , UTF-8 . , UTF-8. , .

( , wc. - , , .)

0

All Articles