Is char encoding the same in all programming languages?

A very simple (and rather elegant) way to convert a lowercase char to int is as follows:

 int convertLowercaseCharLettertoInt(char letter) { return letter - 'a'; } 

However, this code assumes that the char encoding follows the same order as the alphabet. Or, more generally, it is assumed that char follows ASCII encoding.

  • I know that Java char is UTF-16 and C char is ASCII. Although UTF-16 is not backward compatible with ASCII, the ordering of the first 128 letters is the same for both. So, the order of the first 128 char same in all major languages ​​such as C, C ++, Java, C #, JavaScript and Python?
  • Is the method above a safe thing in general (assuming the entrance is sanitized, etc.)? Or is it better to use a hash map or a long switch approach to the approach? I believe the hash map approach is the most elegant way to solve this problem in the case of non-English alphabets. For instance. Czech alphabet: a, á, b, c, č, d, ď, e, é, ě, f, g, h, ch, i, í, j, k, l, m, n, ň, o, ó , p, q, r, ř, s, š, t, ť, u, ú, ů, v, w, x, y, ý, z, ž.
+5
source share
4 answers

This is less related to the programming language, but more about the system core character set. ASCII and all Unicode variants will behave as you expect. 'a' ... 'z' - 26 consecutive code points. There will be no EBCDIC, so your trick will not work on IBM / 360 in most languages.

Java (and Python, and possibly others) languages ​​define Unicode encoding regardless of the underlying platform, so your trick will work there too, assuming you can find the appropriate Java implementation for your IBM mainframe.

+3
source

As for C, you cannot rely on an ASCII executable; the standard establishes only the minimum set of characters that should belong to it. The execution character set can be ASCII, it can be EBCDIC, it can be UTF-8, etc.

Your method is "safe" in the sense that it should not call segfault or open a security hole, but it does not guarantee the return of the expected result.

For the Latin alphabet, you better create your own string and index it:

 char mycharset[] = "abcdefghijklmnopqrstuvwxyz"; if ( isalpha( letter )) // thanks chux. { char *pos = strchr( mycharset, tolower( letter ) ); if ( pos ) return (int) (pos - mycharset); else return -1; // letter not found } return -1; // bad input 

For extended alphabets - I don't know.

+1
source

In C, the compiler can detect problems

 #if 'a'+1=='b' && 'b'+1=='c' && 'c'+1=='d' && 'd'+1=='e' && 'e'+1=='f' \ && 'f'+1=='g' && 'g'+1=='h' && 'h'+1=='i' && 'i'+1=='j' && 'j'+1=='k'\ && 'k'+1=='l' && 'l'+1=='m' && 'm'+1=='n' && 'n'+1=='o' && 'o'+1=='p'\ && 'p'+1=='q' && 'q'+1=='r' && 'r'+1=='s' && 's'+1=='t' && 't'+1=='u'\ && 'u'+1=='v' && 'v'+1=='w' && 'w'+1=='x' && 'x'+1=='y' && 'y'+1=='z' int convertLowercaseCharLettertoInt(char letter) { return letter - 'a'; } #else int convertLowercaseCharLettertoInt(char letter) { static const char lowercase[] = "abcdefghijklmnopqrstuvwxyz"; const char *occurrence = strchr(lowercase, letter); assert(letter && occurrence); return occurrence - lowercase; } #endif 

See also @John Bode code


Note. The following works with all C encodings

 int convertLowercaseOrUppercaseCharLettertoInt(char letter) { char s[2] = { letter, '\0' }; return strtol(s, 0, 36) - 10; } 
+1
source

Why do you convert letters to numbers in your own way, of course, there are standards that accurately describe this, for example, UTF-16, ASCII UTF-8, Latin, Latin-2 ... etc. etc. If you ask if there is a standard implemented in all languages, then the answer is probably yes. But if you ask whether the characters in all languages ​​correspond to the same encoding, then the answer does not contain a consistent representation of the alphabet ... I doubt it.

If you want to compare numbers that meet standards, then there are conversion libraries from one standard to another.

0
source

All Articles