Multibyte (non ASCII) character handling in C

Question

Multibyte (non ASCII) character handling in C

I am trying to execute my own version of wc (unix filter), but I have a problem with non-ASCII characters. I made a DEX text file and found that these characters occupy more than one byte, so they will not match char. Is there any way I can read these characters from a file and treat them as a single character (so that I can read the characters in the file) in C? I searched the language a bit and found some type of wchar_t, but there was no simple example of how to use it with files.

+5

c string file character

user561838 Jan 03 '11 at 22:17

source share

5 answers

Joey Adams · Answer 1 · 2011-01-03T22:48:28+0000

wchar_t, , .

. , , , .

: UTF-8 ( Unicode, , ASCII), C UTF-8, ( ) .

" Unicode C?", ICU. ustdio.h, u_fgetc Unicode , , , u_.

, , , ( !) Joel On Software.

ICU, , , : -)

caf · Answer 2 · 2011-01-03T23:09:50+0000

C wc, , tdio wchar_t. setlocale():

setlocale(LC_CTYPE, "");

, , - . Unix- , LANG. , , LANG UTF8, UTF8. ( POSIX wc ).

. , :

long words = 0;
int in_word = 0;
int c;

while ((c = getchar()) != EOF)
{
    if (isspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}

... , c wint_t, getchar() getwchar(), EOF WEOF isspace() iswspace():

long words = 0;
int in_word = 0;
wint_t c;

while ((c = getwchar()) != WEOF)
{
    if (iswspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}

bmargulies · Answer 3 · 2011-01-03T22:26:04+0000

Go ICU. - , .

R.. · Answer 4 · 2011-01-04T02:42:45+0000

, , , :

, getwchar() .
, , mbrtowc .
UTF-8, . , UTF-8, 00-7F C2-F4 , . Unicode .

, .

dan04 · Answer 5 · 2011-01-04T00:37:58+0000

? wc .

~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt 
 1  1 11 hebrew.txt

(11 = 5 + 1 '\n')

, , , , UTF-8, - , (.. 0x80 0xBF).

UTF-8, , , UTF-8, , UTF-8 . , UTF-8. , .

( , wc. - , , .)

Multibyte (non ASCII) character handling in C

More articles: