Questions When Updating Scanner Code to Use ICU

I am working on a rudimentary lexical scanner with manual coding and want to support the input of UTF-8 (this is not 1970!). Input characters are read from stdinor one at a time and buffered until a space appears, etc. I was thinking of writing my own shell for fgetc(), which instead will return the char[]bytes that make up the UTF-8 character and working with the result as a string ... it would be easy enough, but would become a slippery slope. I would prefer not to waste time re-creating the wheel and use the existing test library like ICU instead . So, now I have the support of code other than UTF-8, which works with fgetc(), isspace(),strcmp()etc. That I am trying to upgrade to use ICU. This is my first foray with the ICU, and we read the documentation and try to find usage examples with Google code search, but there are still some problems with the confusion. I hope someone can clarify.

The function u_fgetc()returns UChar, but u_fgetcx()returns UChar32... in the documentation it is recommended to use u_fgetcx()to read code pages to start where I start. I follow the same approach as above, but I click UChar32on the buffer instead of chars.

  • What is the correct way to compare a character with a known value? Initially, I was able to do if (c == '+')to check if a plus sign was added from the input. The GCC does not complain when it cis UChar32(which is then a comparison between UChar32and char), but is this really true?

  • I was able to use strcmp()to compare buffered characters with a known value, for example if ((strcmp(buf, "else") == 0). There is u_strcmp()one provided by the ICU, and I think I might need to use macros U_STRING_DECLand U_STRING_INITto indicate a well-known literal, but I'm not sure. The documentation shows what they lead to UChar[], although I assume that I need UChar32[]... and I'm not sure how to use them correctly. Any recommendations here would be welcome.

  • strtol(), . , ICU, UChar32[] ?

+5
2

UChar , UChar32 - . Basic Multilingual Plane (BMP), UChar , , ICU UChar[].

ICU User Guide, .

  • Unicode ? ( UChar UChar32) - . , C99 ( 6.4.3) : \u, \u, , ISO/IEC 10646. 0x00a0 ( 0x0024 '$', 0x0040 '@' 0x0060 (backtick) ( ​​ UChar). 0xd800 0xdfff ( UTF-16).

  • Unicode? U_STRING_DECL U_STRING_INIT , . ( , ICU UChar[].) ++ C, UNICODE_STRING_SIMPLE ( getTerminatedBuffer(), UChar[] ) Unicode.

  • , , ? unum_parse() unum.h .

+5
  • Unicode PLUS SIGN U + 002B, (Latin-1) '+' 0x2B (053, 43). , , , ASCII ISO-8859-x. C99 Unicode ( ) \u0123 \U00102345 ( 4 8 ), , \u00A0, \u002B. , , .

    , enum,

     enum { PLUS_SIGN = '+' };
    

    , . , ( ) , - .

    , Strings ICU , UTF-32 .

    /li >
  • C , , wcscmp(buf, L"else"), , wchar_t uint32_t / UChar32. , UnicodeString UNICODE_STRING("..."), ToUTF32() UTF-32. .

  • "", , . , , , NumberFormat.

+2

All Articles