Why is Dchar not a standard character type in D?

Just by looking at the digitalmars.D.learn forum and the D related StackOverflow question, it seems to me that the main mistake for a beginner D programmer (including me) is the difference in the use and capabilities of char, wchar, dchar and related string types. This leads to the following problems:

I know that this should be for backward compatibility reasons and for developers who came with C ++ or C, but I think that a pretty convincing argument may be that this possible gain is compensated by the problems experienced by the same developers when they try to do something non-trivial with a char or string and expect it to work the same way as in C / C ++, only so that it does not work with difficult to debug methods.

To fix many of these problems, I saw experienced D community members occasionally tell inexperienced developers to use dchar to avoid such issues, which raises the question of why char is not a 32-bit Unicode character by default, with 8-bit characters ASCII attributed to achar or the like, to be affected only when necessary

+7
source share
3 answers

Personally, I am sorry that char does not exist and that instead of char , wchar and dchar we had something like utf8 , utf16 and utf32 . Then everyone would immediately have to realize that char not something that should be used for individual characters, but that’s not how it was. I would say that it is almost certain that char was just taken from C / C ++, and then others were added to improve Unicode support. After all, with char there is nothing fundamentally wrong. Just so many programmers mistakenly understand that char always a symbol (which is not necessarily true even in C / C ++). But Walter Bright is very well versed in Unicode and seems to think everyone else should, too, so he seeks to make decisions regarding Unicode that work very well if you understand Unicode but don't work very well if you don't ( and most programmers do not). D pretty much forces you to come to at least a basic understanding of Unicode, which is not so bad, but it causes some people.

But the reality is that while it makes sense to use dchar for individual characters, it usually doesn't make sense to use it for strings. Sometimes what you need, but UTF-32 requires more space than UTF-8. This can affect performance and definitely affect the memory size of your programs. And a lot of string processing generally does not require random access. Thus, having UTF-8 strings by default makes much more sense than the default values ​​for UTF-32 strings.

The string management method in D usually works very well. It's just that the char name has the wrong connotation for many people, and the language, unfortunately, prefers the default character literals to be char rather than dchar in many cases.

I think a pretty convincing argument might be that this possible gain is offset by problems experienced by the same developers when they try something non-trivially with a char or string and expect it to work the way it would in C / C ++, only so that it does not work with difficult to debug methods.

The reality of the question is that strings in C / C ++ work the same as in D, only they do not protect you from ignorance or stupidity, unlike D. char in C / C ++ is always 8 bits and usually it is processed by the OS as UTF-8 (at least in * nix land - Windows does strange things for encoding for char and usually requires wchar_t for Unicode). Of course, any Unicode strings that you have in C / C ++ are in UTF-8 unless you explicitly use a string type that uses a different encoding. std::string and C strings work on code modules, not code points. But the average C / C ++ programmer treats them as if each of their elements was a whole character, which is simply wrong if you only use ASCII, and on this day and age, this is often a very bad assumption.

D takes the path of actually creating proper Unicode support into the language and into its standard library. This forces you to come to at least a basic understanding of Unicode and often makes it difficult to ruin it by providing those who really understand its extremely powerful Unicode string management tools not only correctly, but also efficiently. C / C ++ simply tries to solve this problem and allows programmers to step on Unicode land mines.

+12
source

I understood the question like "Why is dchar not used by default by default?"

dchar is the UTF-32 code code. You rarely want to deal with blocks of UTF-32 code because you are wasting too much space, especially if you are dealing only with ASCII strings.

The use of UTF-8 code blocks (the adequate type in D is char) is much more economical.

String D is immutable(char)[] , that is, an array of UTF-8 code blocks.

Yes, perhaps dealing with UTF-32 code units, you can increase the speed of your application if you constantly do random access with strings. But if you know that you are going to do this with specific text, use the dstring type in this case. This suggests that you should now understand why D treats strings as dchar ranges.

+1
source

Due to the combination of characters, even dchar cannot really hold on to all Unicode characters (in any way people want to think about it) and cannot be indexed directly (see the end of this post for an example).

0
source

All Articles