Is char encoding the same in all programming languages?

Question

Is char encoding the same in all programming languages?

A very simple (and rather elegant) way to convert a lowercase char to int is as follows:

 int convertLowercaseCharLettertoInt(char letter) { return letter - 'a'; }

However, this code assumes that the char encoding follows the same order as the alphabet. Or, more generally, it is assumed that char follows ASCII encoding.

I know that Java char is UTF-16 and C char is ASCII. Although UTF-16 is not backward compatible with ASCII, the ordering of the first 128 letters is the same for both. So, the order of the first 128 char same in all major languages such as C, C ++, Java, C #, JavaScript and Python?
Is the method above a safe thing in general (assuming the entrance is sanitized, etc.)? Or is it better to use a hash map or a long switch approach to the approach? I believe the hash map approach is the most elegant way to solve this problem in the case of non-English alphabets. For instance. Czech alphabet: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó , p, q, r, ř, s, š, t, ť, u, ú, ů, v, w, x, y, ý, z, ž.

+5

java c ++ c python char

Augustin Aug 28 '15 at 14:37

source share

4 answers

Lee Daniel Crocker · Answer 1 · 2015-08-28T16:20:14+0000

This is less related to the programming language, but more about the system core character set. ASCII and all Unicode variants will behave as you expect. 'a' ... 'z' - 26 consecutive code points. There will be no EBCDIC, so your trick will not work on IBM / 360 in most languages.

Java (and Python, and possibly others) languages define Unicode encoding regardless of the underlying platform, so your trick will work there too, assuming you can find the appropriate Java implementation for your IBM mainframe.

John bode · Answer 2 · 2015-08-28T15:41:23+0000

As for C, you cannot rely on an ASCII executable; the standard establishes only the minimum set of characters that should belong to it. The execution character set can be ASCII, it can be EBCDIC, it can be UTF-8, etc.

Your method is "safe" in the sense that it should not call segfault or open a security hole, but it does not guarantee the return of the expected result.

For the Latin alphabet, you better create your own string and index it:

 char mycharset[] = "abcdefghijklmnopqrstuvwxyz"; if ( isalpha( letter )) // thanks chux. { char *pos = strchr( mycharset, tolower( letter ) ); if ( pos ) return (int) (pos - mycharset); else return -1; // letter not found } return -1; // bad input

For extended alphabets - I don't know.

chux · Answer 3 · 2015-08-28T17:35:36+0000

In C, the compiler can detect problems

 #if 'a'+1=='b' && 'b'+1=='c' && 'c'+1=='d' && 'd'+1=='e' && 'e'+1=='f' \ && 'f'+1=='g' && 'g'+1=='h' && 'h'+1=='i' && 'i'+1=='j' && 'j'+1=='k'\ && 'k'+1=='l' && 'l'+1=='m' && 'm'+1=='n' && 'n'+1=='o' && 'o'+1=='p'\ && 'p'+1=='q' && 'q'+1=='r' && 'r'+1=='s' && 's'+1=='t' && 't'+1=='u'\ && 'u'+1=='v' && 'v'+1=='w' && 'w'+1=='x' && 'x'+1=='y' && 'y'+1=='z' int convertLowercaseCharLettertoInt(char letter) { return letter - 'a'; } #else int convertLowercaseCharLettertoInt(char letter) { static const char lowercase[] = "abcdefghijklmnopqrstuvwxyz"; const char *occurrence = strchr(lowercase, letter); assert(letter && occurrence); return occurrence - lowercase; } #endif

Is char encoding the same in all programming languages?

More articles: