How do you deal with signed char & # 8594; int problems with standard library?

Question

How do you deal with signed char & # 8594; int problems with standard library?

This is a very old problem in my work, I understand that I still do not have a good solution ...

C naively defined all of its character verification functions for int:

int isspace(int ch);

But char is often signed, and the full character often does not fit in an int or in any separate memory block that is used for ****** strings.

And these functions were the logical template for the current C ++ functions and methods and formed the basis for the current standard library. In fact, they are still supported, afaict.

So, if you pass isspace (* pchar), you may run into character extension problems. They are difficult to understand, and from there they are difficult to protect from my experience.

Similarly, since isspace () and it ilk all accept ints, but because the actual character width is often unknown without string analysis, this means that any modern character library will almost never move around char or wchar_t, but only pointers / iterators , since only by analyzing the flow of characters you can know how many of them are one logical symbol, I lose a little, how much better is it to approach the problems?

I continue to rely on a truly reliable library based on abstracting the size factor of any character and working only with strings (providing things like isspace, etc.), but either I skipped this or another simpler solution that looks to me in the face that all of you (who knows what you are doing) are using ...

** These problems do not cause fixed-size encodings that may completely contain the full character. UTF-32 is apparently the only option that has these characteristics (or specialized environments that limit themselves to ASCII or some).

So my question is:

"How do you check spaces, fingerprints, etc. so that you don't suffer from two problems:

1) Sign extension and
2) problems with variable width characters

In the end, most character encodings have a variable width: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS. Even extended ASCII can have the simple task of expanding a character if the compiler treats char as a signed 8-bit block.

Note:

Regardless of the size of your char_type character, this is incorrect for most character encoding schemes.

This problem is in the standard C library, as well as in the standard C ++ libraries; which is still trying to bypass char and wchar_t, rather than string iterators in various implementations of isspace, isprint, etc.

Actually, it is precisely those types of functions that break the generality of std :: string. If he worked only in storage units and did not try to pretend that he understood the meaning of storage units as logical characters (e.g. isspace), then the abstraction will be much more honest and will force us programmers to look elsewhere for real solutions ...

Thank you

Everyone who participated. Between this discussion and WChars, encodings, standards, and portability I have a much more efficient way to troubleshoot. Although there are no simple answers, every bit of understanding helps.

+8

c ++ c special-characters character-encoding

Mordachai Nov 10 '11 at 16:42

source share

8 answers

I think you are mixing a number of unrelated concepts.

First of all, char is just a data type. Its first and foremost meaning is the "basic storage unit of the system," that is, "one byte." Its subscription is intentionally left before the implementation, so that each implementation can choose the most suitable (i.e., hardware-supported) version. This name, suggesting "character", is quite possibly the worst solution in the design of the C programming language.

The next concept is a text string. On a foundation, a text is a sequence of units that are often called "characters", but they can be more attractive than that. To this end, the Unicode standard combines the term "code point" to denote the most basic unit of text. At the moment, for us programmers, “text” is a sequence of code points.

The problem is that there are more code points than possible byte values. This problem can be solved in two different ways: 1) use multibyte encoding to represent sequences of code points as sequences of bytes; or 2) use a different underlying data type. C and C ++ actually offer both solutions: their own host interface (args command lines, file contents, environment variables) are provided as sequences of bytes; but the language also provides an opaque wchar_t type for the "set of system characters", as well as a translation function between them ( mbstowcs / wcstombs ).

Unfortunately, there is nothing specific about the “system character set” and “system multi-byte encoding”, so you, like many SO users, are confused about what to do with these mysterious wide characters. What people want today is a certain encoding that they can use on different platforms. The only useful encoding we have for this purpose is Unicode, which in the textual sense assigns a large number of code points (up to 2 ²¹ ). Along with text encoding, a family of byte encodings UTF-8, UTF-16 and UTF-32 appears.

The first step to reviewing the contents of a given text string, therefore, is to convert it from any input file that you have into a specific (Unicode) encoding string. This Unicode string itself can be encoded in any of the conversion formats, but the simplest is like a sequence of source code points (usually UTF-32, since we do not have a useful 21-bit data type).

The implementation of this conversion is already beyond the scope of the C ++ standard (even a new one), so for this we need a library. Since we do not know anything about our “system character set”, we also need a library to handle this.

One popular library of choice is iconv() ; a typical sequence goes from the input multibyte char* through mbstowcs() to a std::wstring or wchar_t* wide string, and then through iconv() converts WCHAR_T-to-UTF32 to std::u32string or uint32_t* raw Unicode code sequence.

At this moment, our journey ends. Now we can either examine the text code using a code point (which may be enough to determine if there is something in space); or we can call a heavier text processing library to perform complex text operations in our Unicode codec (for example, normalization, canonization, presentation transformation, etc.). This goes far beyond a general programmer and word processing specialist.

+7

Kerrek SB Nov 10 '11 at 17:50

source share

In any case, it is not permissible to pass a negative value other than EOF to isspace and other characters. If you have char c and want to check if this is a space or not, do isspace((unsigned char)c) . This applies to the extension (with zero extension). isspace(*pchar) is wrong - do not record it, do not let it stand when you see it. If you train yourself in a panic when you see it, then it is less difficult to see.

fgetc (for example) already returns either EOF or a character read as unsigned char and then converted to int , so there is no problem with character expansion for the values from this.

These little things are valid, however, since standard character macros do not cover Unicode or multibyte encodings. If you want to handle Unicode correctly, you need a Unicode library. I have not looked at what C ++ 11 or C1X provides in this regard, except that C ++ 11 has std::u32string , which seems promising. Prior to this, the answer is to use something specific for implementation or third-party. (Un), fortunately, there are many libraries.

It may be (I reflect) that the “complete” Unicode classification database is so large and therefore subject to change that it would be impractical for the C ++ standard to mandate “full” support anyway. It depends on what operations you should support, but you cannot get away from the problem that caused Unicode to go through 6 major versions in 20 years (starting from the first standard version), while C ++ has two main versions for 13 years, As for C ++, the Unicode character set is a fast moving target, so it will always be determined by the implementation of what the code indicates that the system knows about.

In general, there are three correct ways to process text in Unicode:

On all inputs / outputs (including system calls that return or receive strings), convert everything between the encoding used from the outside and the fixed-width internal encoding. You can think of it as “deserializing” the input and “serializing” the output. If you have some type of object with functions for converting it to / from the byte stream, you will not mix the byte stream with the objects or analyze sections of the byte stream for pieces of serialized data that you think you will recognize. For this internal Unicode string code, it is not necessary to be different. Note that the class may not be std::string and may not be std::wstring either, depending on the implementation. Just pretend that the standard library does not provide strings if that helps, or use std::basic_string something big like a container, but a Unicode-enabled library to do something complicated. You may also need to understand Unicode normalization in order to deal with label concatenation, etc., since even in Unicode with a fixed width, there can be more than one code per glyph.
Worrying with some special combination of byte sequences and Unicode sequences, which are closely monitored. It is like (1), but it is usually more complicated, and therefore, although it is potentially correct, in practice it can be just as easily mistaken.
(For special purposes only): Use UTF-8 for everything. This is sometimes good enough, for example, if all you do is syntax input based on ASCII punctuation and concatenation of strings for output. This basically works for programs where you don’t need to understand anything with a set of high bits, just pass it unchanged. It doesn’t work so well if you really need to visualize the text or otherwise do something that a person considers “obvious”, but really complex. Like sorting.

+5

Steve jessop Nov 10 '11 at 17:23

source share

One comment ahead: old C functions, such as isspace , took int for a reason: they support EOF as input, so they must be able to support one more value than char . the "naive" solution allowed char be signed - but that its non-subscription would have serious consequences for PDP-11.

Now for your questions:

1) Sign extension

C ++ functions do not have this problem. In C ++, the “right” way to test things like character space is to grab the std::ctype face from any desired locale and use it. Of course, the C ++ localization in <locale> has been carefully designed to make it as possible as possible, but if you do significant word processing, you will soon come to your own convenient wrappers: a functional object that accepts a locale and a mask that determines which the characteristic you want to check is not difficult. Creating a masked template and providing its argument locale a default global locale is also not rocket science. Throw a few typedef and you can pass things like IsSpace() - std::find . The only refinement is managing the lifetime of the std::ctype object you are dealing with. However, the following should work:

 template<std::ctype_base::mask mask> class Is // Must find a better name. { std::locale myLocale; //< Needed to ensure no premature destruction of facet std::ctype<char> const* myCType; public: Is( std::locale const& l = std::locale() ) : myLocale( l ) , myCType( std::use_facet<std::ctype<char> >( l ) ) { } bool operator()( char ch ) const { return myCType->is( mask, ch ); } }; typedef Is<std::ctype_base::space> IsSpace; // ...

(Given the influence of STL, it is somewhat surprising that the standard did not define something like the above as a standard.)

2) Problems with characters of variable width.

There is no real answer. It all depends on what you need. For some applications, just searching for several separate characters of one byte is enough and storing everything in UTF-8 and ignoring multibyte problems is a viable (and simple) solution. Besides this, it is often useful for converting to UTF-32 (or depending on the type of text that you use when dealing with UTF-16) and use each element as a single code point. For the other hand, you need to deal with multicode characters, even if you use UTF-32: the sequence \u006D\u0302 is one character (small m with an envelope above it).

+3

James kanze Nov 10 '11 at 19:39

source share

I have not tested the internationalization capabilities of the Qt library, but from what I know, QString fully supports unicode and uses QChar, which are unicode characters. I do not know the internal implementation of these, but I expect this to imply that QChar are varaible size characters.

It would be weird to attach yourself to such a large structure like Qt, just for using strings.

0

j_kubik Nov 10 '11 at 17:22

source share

It seems you are confusing a function defined on a 7-bit ascii with a universal space recognition function. The character functions in standard C use int to not deal with different encodings, but to allow EOF be an out-of-band indicator. There is no problem with character expansion, since the numbers indicated by these functions do not have an 8th bit. Providing a byte with this feature is a mistake on your part.

Plan 9 attempts to solve this problem using the UTF library and assuming all input is UTF-8. This allows some degree of backward compatibility with ASCII, so incompatible programs do not all die, but they allow you to write new programs correctly.

A common concept in C is that a char* is an array of letters. Instead, it should be considered as an input block. To get the letters from this stream, you use chartorune() . Each Rune is a representation of a letter (/ symbol / codepoint), so you can finally define the isspacerune() function, which finally tells you which letters are spaces.

Working with Rune arrays, like with char arrays, to perform string manipulations, then call runetochar() to transcode your letters into UTF-8 before you write it.

0

Dave Nov 10 '11 at 19:33

source share

Your argument with the preamble is somewhat inadequate and perhaps unfair, it is simply not in the library design to support Unicode encodings - certainly not multiple Unicode encodings.

The development of C and C ++ languages and most of the libraries preceding the development of Unicode. Also, as a system-level language, a data type is required that corresponds to the smallest addressable word size of the runtime. Unfortunately, it is possible that the char type is overloaded to represent both the character set of the runtime and the minimum addressable word. This is a story that showed that this might be wrong, but changing the language definition and indeed the library will break a lot of legacy code, so things will remain in newer languages like C #, which has an 8-bit byte and great char type.

In addition, the Unicode variable representation encoding makes it unsuitable for the embedded data type as such. You obviously know this, because you assume that operations with Unicode characters should be performed in strings, and not in machine types of words. This will require library support, and as you have noticed, this is not provided by the standard library. There are a number of reasons for this, but first of all it does not fall into the scope of the standard library, just as there is no standard library support for networks or graphics. , . , .

- /, . , .

" , .. , :
1)
2)

isspace() 8 . , , unsigned char EOF, undefined. , . , , , , .

, Unicode , : UTF-7, UTF-8, UTF-16, , Shift-JIS

isspace() Unicode. , , . Unicode C? .

0

Clifford Nov 10 '11 at 19:49

source share

. You can use:

isspace((unsigned char) ch)
isspace(ch & 0xFF)
, char

( UTF-8), .

ASCII \t\n\v\f\r , isspace ; -ASCII UTF-8 -.

Unicode \x85\xa0\u1680\u180e\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u2028\u2029\u202f\u205f\u3000 , .

 bool isspace_utf8(const char* pChar) { uint32_t codePoint = decode_char(*pChar); return is_unicode_space(codePoint); }

decode_char UTF-8 , is_unicode_space true Z Cc , . iswspace , , ++ Unicode. Unicode .

, UTF-7, UTF-8, UTF-16, SHIFT-JIS ..

UTF-7 Shift-JIS , . Stick ŬTF-8, -16 -32 .

0

dan04 Nov 11 '11 at 9:11

source share

Mooing duck · Accepted Answer · 2011-11-10T17:19:19+0000

How do you check spaces, fingerprints, etc. so that there are no two problems:
1) Sign extension
2) problems with variable width characters
In the end, all commonly used Unicode encodings have a variable width, regardless of whether they are implemented by programmers or not: UTF-7, UTF-8, UTF-16, as well as older standards such as Shift-JIS .. .

Obviously, you need to use a Unicode compatible library, as you have demonstrated (correctly) that the standard C ++ 03 library is not. The C ++ 11 library is improved, but still not good enough for most applications. Yes, some operating systems have 32-bit wchar_t, which allows them to handle UTF32 correctly, but this implementation is not guaranteed by C ++ and is not removed enough for many unicode tasks, such as iterating over Graphemes (letters),

Ibicu
Libiconv
microUTF-8
UTF-8 CPP version 1.0
utfproc
and more at http://unicode.org/resources/libraries.html .

If the question is less about specific character testing and about code methods in general: do whatever your infrastructure does. If you are coding linux / QT / networking, keep everything inside UTF-8. If you are encoding Windows, keep everything inside UTF-16. If you need to bother with code points, save everything inside UTF-32. Otherwise (for portable, generic code) do whatever you want, because in spite of everything, you must translate for any OS or another in any case.

How do you deal with signed char & # 8594; int problems with standard library?

So my question is:

Note:

Thank you

More articles: