Original answer
One version uses an array initialized with the corresponding values, one byte per character in the code set (plus 1 to enable EOF, which can also be passed to classification functions):
static const char bits[257] = { ...initialization... }; int isupper(int ch) { assert(ch == EOF || (ch >= 0 && ch <= 255)); return((bits+1)[ch] & UPPER_MASK); }
Note that βbitsβ can be used by all of the various functions, such as isupper (), islower (), isalpha (), etc. with appropriate values ββfor the mask. And if you make the "bit" array change at run time, you can adapt to different (single-byte) sets of codes.
It takes a space - an array.
In another version, assumptions are made about the adjacency of uppercase characters, as well as a limited set of valid uppercase characters (great for ASCII, which is not very good for ISO 8859-1 or its relatives):
int isupper(int ch) { return (ch >= 'A' && ch <= 'Z');
This can be (almost) implemented in a macro; it is difficult to avoid evaluating a character twice, which is not actually permitted in the standard. Using non-standard (GNU) extensions, it can be implemented as a macro that evaluates a character argument only once. To extend this value to ISO 8859-1, a second condition is required in accordance with:
int isupper(int ch) { return ((ch >= 'A' && ch <= 'Z')) || (ch >= 0xC0 && ch <= 0xDD)); }
Repeat this as a macro very often, and "space saving" quickly becomes a cost, since the bit-masking has a fixed size.
Given the requirements of modern code sets, the display version is almost always used in practice; it can adapt at runtime to the current set of code, etc., which cannot use range-based versions.
Extended answer
I still canβt understand how UPPER_MASK works. Could you explain this more specifically?
Ignoring namespace problems for characters in headings, you have a series of twelve classification macros:
isalpha()isupper()islower()isalnum()isgraph()isprint()iscntrl()isdigit()isblank()isspace()ispunct()isxdigit()
Difference between isspace() and isblank() :
isspace() - space ( ' ' ), form feed ( '\f' ), new line ( '\n' ), carriage return ( '\r' ), horizontal tab ( '\t' ) and vertical tab ( '\v' ).isblank() - space ( ' ' ) and horizontal tab ( '\t' ).
There are definitions for these character sets in the C standard and recommendations for the C language standard.
For example, (in the C locale) for "islower ()" or isupper() true if isalpha() true.
I think the necessary bits are:
DIGIT_MASKXDIGT_MASKALPHA_MASKLOWER_MASKUPPER_MASKPUNCT_MASKSPACE_MASKPRINT_MASKCNTRL_MASKBLANK_MASK
From these ten masks, you can create two others:
- ALNUM_MASK = ALPHA_MASK | DIGIT_MASK
- GRAPH_MASK = ALNUM_MASK | PUNCT_MASK
You can also use ALPHA_MASK = UPPER_MASK | LOWER_MASK ALPHA_MASK = UPPER_MASK | LOWER_MASK , but some locales have alphabetic characters that are neither uppercase nor lowercase.
So, we can define masks as follows:
enum CTYPE_MASK { DIGIT_MASK = 0x0001, XDIGT_MASK = 0x0002, LOWER_MASK = 0x0004, UPPER_MASK = 0x0008, ALPHA_MASK = 0x0010, PUNCT_MASK = 0x0020, SPACE_MASK = 0x0040, PRINT_MASK = 0x0080, CNTRL_MASK = 0x0100, BLANK_MASK = 0x0200, ALNUM_MASK = ALPHA_MASK | DIGIT_MASK, GRAPH_MASK = ALNUM_MASK | PUNCT_MASK }; extern unsigned short ctype_bits[];
Character set data; the data is shown for the first half of ISO 8859-1, but the same for the first half of all 8859-x code sets. I use the designated C99 initializers as documentary help, even if the entries are okay:
unsigned short ctype_bits[] = { [EOF +1] = 0, ['\0' +1] = CNTRL_MASK, ['\1' +1] = CNTRL_MASK, ['\2' +1] = CNTRL_MASK, ['\3' +1] = CNTRL_MASK, ['\4' +1] = CNTRL_MASK, ['\5' +1] = CNTRL_MASK, ['\6' +1] = CNTRL_MASK, ['\a' +1] = CNTRL_MASK, ['\b' +1] = CNTRL_MASK, ['\t' +1] = CNTRL_MASK|SPACE_MASK|BLANK_MASK, ['\n' +1] = CNTRL_MASK|SPACE_MASK, ['\v' +1] = CNTRL_MASK|SPACE_MASK, ['\f' +1] = CNTRL_MASK|SPACE_MASK, ['\r' +1] = CNTRL_MASK|SPACE_MASK, ['\x0E'+1] = CNTRL_MASK, ['\x0F'+1] = CNTRL_MASK, ['\x10'+1] = CNTRL_MASK, ['\x11'+1] = CNTRL_MASK, ['\x12'+1] = CNTRL_MASK, ['\x13'+1] = CNTRL_MASK, ['\x14'+1] = CNTRL_MASK, ['\x15'+1] = CNTRL_MASK, ['\x16'+1] = CNTRL_MASK, ['\x17'+1] = CNTRL_MASK, ['\x18'+1] = CNTRL_MASK, ['\x19'+1] = CNTRL_MASK, ['\x1A'+1] = CNTRL_MASK, ['\x1B'+1] = CNTRL_MASK, ['\x1C'+1] = CNTRL_MASK, ['\x1D'+1] = CNTRL_MASK, ['\x1E'+1] = CNTRL_MASK, ['\x1F'+1] = CNTRL_MASK, [' ' +1] = SPACE_MASK|PRINT_MASK|BLANK_MASK, ['!' +1] = PUNCT_MASK|PRINT_MASK, ['"' +1] = PUNCT_MASK|PRINT_MASK, ['#' +1] = PUNCT_MASK|PRINT_MASK, ['$' +1] = PUNCT_MASK|PRINT_MASK, ['%' +1] = PUNCT_MASK|PRINT_MASK, ['&' +1] = PUNCT_MASK|PRINT_MASK, ['\'' +1] = PUNCT_MASK|PRINT_MASK, ['(' +1] = PUNCT_MASK|PRINT_MASK, [')' +1] = PUNCT_MASK|PRINT_MASK, ['*' +1] = PUNCT_MASK|PRINT_MASK, ['+' +1] = PUNCT_MASK|PRINT_MASK, [',' +1] = PUNCT_MASK|PRINT_MASK, ['-' +1] = PUNCT_MASK|PRINT_MASK, ['.' +1] = PUNCT_MASK|PRINT_MASK, ['/' +1] = PUNCT_MASK|PRINT_MASK, ['0' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['1' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['2' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['3' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['4' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['5' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['6' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['7' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['8' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['9' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, [':' +1] = PUNCT_MASK|PRINT_MASK, [';' +1] = PUNCT_MASK|PRINT_MASK, ['<' +1] = PUNCT_MASK|PRINT_MASK, ['=' +1] = PUNCT_MASK|PRINT_MASK, ['>' +1] = PUNCT_MASK|PRINT_MASK, ['?' +1] = PUNCT_MASK|PRINT_MASK, ['@' +1] = PUNCT_MASK|PRINT_MASK, ['A' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['B' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['C' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['D' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['E' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['F' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['G' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['H' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['I' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['J' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['K' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['L' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['M' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['N' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['O' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['P' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Q' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['R' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['S' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['T' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['U' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['V' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['W' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['X' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Y' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Z' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['[' +1] = PUNCT_MASK|PRINT_MASK, ['\\' +1] = PUNCT_MASK|PRINT_MASK, [']' +1] = PUNCT_MASK|PRINT_MASK, ['^' +1] = PUNCT_MASK|PRINT_MASK, ['_' +1] = PUNCT_MASK|PRINT_MASK, ['`' +1] = PUNCT_MASK|PRINT_MASK, ['a' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['b' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['c' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['d' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['e' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['f' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['g' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['h' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['i' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['j' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['k' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['l' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['m' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['n' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['o' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['p' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['q' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['r' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['s' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['t' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['u' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['v' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['w' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['x' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['y' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['z' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['{' +1] = PUNCT_MASK|PRINT_MASK, ['|' +1] = PUNCT_MASK|PRINT_MASK, ['}' +1] = PUNCT_MASK|PRINT_MASK, ['~' +1] = PUNCT_MASK|PRINT_MASK, ['\x7F'+1] = CNTRL_MASK, ...continue for second half of 8859-x character set... }; #define isalpha(c) ((ctype_bits+1)[c] & ALPHA_MASK) #define isupper(c) ((ctype_bits+1)[c] & UPPER_MASK) #define islower(c) ((ctype_bits+1)[c] & LOWER_MASK) #define isalnum(c) ((ctype_bits+1)[c] & ALNUM_MASK) #define isgraph(c) ((ctype_bits+1)[c] & GRAPH_MASK) #define isprint(c) ((ctype_bits+1)[c] & PRINT_MASK) #define iscntrl(c) ((ctype_bits+1)[c] & CNTRL_MASK) #define isdigit(c) ((ctype_bits+1)[c] & DIGIT_MASK) #define isblank(c) ((ctype_bits+1)[c] & BLANK_MASK) #define isspace(c) ((ctype_bits+1)[c] & SPACE_MASK) #define ispunct(c) ((ctype_bits+1)[c] & PUNCT_MASK) #define isxdigit(c) ((ctype_bits+1)[c] & XDIGT_MASK)
As already noted, the names here are actually in the namespace reserved for users, so if you looked at the <ctype.h> heading, you will find more cryptic names, and they will probably start with one or two underscores.