Can we switch between ASCII and Unicode

I came across the "char variable in Unicode format, but it also adapts / displays ASCII well." What should be mentioned? Of course, ASCII is 1 byte, and Unicode is 2. And Unicodeitself contains an ASCII code (the default is its standard). So are there some languages ​​in which the char variable supports UNICODE but not ASCII?

Also, the character format (Unicode / ASCII) is determined by the platform we use, right? (UNIX, Linux, Windows, etc.). So, suppose my platform used ASCII, is it impossible to switch to Unicode or vice versa?

+4
source share
2 answers

Java uses Unicode internally. Always. In fact, it uses UTF-16 most of the time, but so far there are too many details.

It may not use ASCII internally (e.g. for String ). You can represent any String that can be represented in ASCII in Unicode, so this should not be a problem.

The only place the platform comes into play is when Java has to choose the encoding, when you didn't specify it. For example, when you create a FileWriter to write String values ​​to a String: at this point, Java should use an encoding to indicate how a particular character should be matched with bytes. If you do not specify one, the default encoding of the platform is used. This default encoding is almost never ASCII . Most Linux platforms use UTF-8, Windows often uses some derivatives of ISO-8859- * (or other culture-specific 8-bit encodings), but no current OS uses ASCII (simply because ASCII cannot represent much important characters).

In fact, pure ASCII is almost irrelevant these days: nobody uses it. ASCII is important only as a general display subset of most 8-bit encodings (including UTF-8): the bottom 128 Unicode code points display 1: 1 numeric values ​​0-127 in many, many encodings. But pure ASCII (where the values ​​are 128-255 undefined) is no longer used.

+5
source

Unicode is a strict superset of ASCII (and Latin 1, for that matter), at least with respect to the character set. Not so much for actual encodings at the byte level. Thus, there cannot be a language / environment that supports Unicode but not ASCII. What the above sentence means is that if you are dealing only with ASCII text, everything works fine because, as noted, Unicode is a superset of ASCII.

In addition, to clarify some of your misconceptions:

  • "ASCII is 1 byte, and Unicode is 2" - ASCII is a 7-bit code that uses 1 byte for each character. Therefore, bytes and characters are the same in ASCII (which is unsuccessful, because ideally bytes are just data, and text is in characters, but I'm distracted). Unicode is a 21-bit code that defines the mapping of code points (numbers) to characters. How these numbers are represented depends on the encoding. There is UTF-32, which is a fixed-width encoding, where each Unicode code point is represented as a 32-bit code. UTF-16 is what Java uses, which uses two or four bytes (one or two blocks of code) for the code point. But it is 16 bits per unit of code, not per code point or actual character (in the sense of Unicode). Then there is UTF-8, which uses 8-bit code units and represents code points as one, two, three or four code blocks.

  • For Java, at least the platform does not have the right to talk about whether it supports only ASCII or Unicode. Java always uses Unicode and char is a code unit of UTF-16 (which can be half-characters), not code points (which would be characters) and is therefore a little mistakenly named. What you probably have in mind is the Unix tradition of combining language, language, and preferred system encoding into several environment variables. That is, you may have a system in which this preferred encoding indicates an outdated encoding and applications that are blindly using, which may have problems. This does not mean that you cannot create an application that supports Unicode on such systems. iconv should work somehow.

+2
source

All Articles