C function isupper ()

I am currently reading "2nd Edition C-Programming Language," and I do not understand this exercise:

Functions such as isupper can be implemented to save space or to save time. Explore both possibilities.

  • How to implement this function?
  • How to write two versions, one to save time and one to save space (some pseudocode will be nice)?

I would appreciate some advice on this.

+6
c
source share
5 answers

Original answer

One version uses an array initialized with the corresponding values, one byte per character in the code set (plus 1 to enable EOF, which can also be passed to classification functions):

static const char bits[257] = { ...initialization... }; int isupper(int ch) { assert(ch == EOF || (ch >= 0 && ch <= 255)); return((bits+1)[ch] & UPPER_MASK); } 

Note that β€œbits” can be used by all of the various functions, such as isupper (), islower (), isalpha (), etc. with appropriate values ​​for the mask. And if you make the "bit" array change at run time, you can adapt to different (single-byte) sets of codes.

It takes a space - an array.

In another version, assumptions are made about the adjacency of uppercase characters, as well as a limited set of valid uppercase characters (great for ASCII, which is not very good for ISO 8859-1 or its relatives):

 int isupper(int ch) { return (ch >= 'A' && ch <= 'Z'); // ASCII only - not a good implementation! } 

This can be (almost) implemented in a macro; it is difficult to avoid evaluating a character twice, which is not actually permitted in the standard. Using non-standard (GNU) extensions, it can be implemented as a macro that evaluates a character argument only once. To extend this value to ISO 8859-1, a second condition is required in accordance with:

 int isupper(int ch) { return ((ch >= 'A' && ch <= 'Z')) || (ch >= 0xC0 && ch <= 0xDD)); } 

Repeat this as a macro very often, and "space saving" quickly becomes a cost, since the bit-masking has a fixed size.

Given the requirements of modern code sets, the display version is almost always used in practice; it can adapt at runtime to the current set of code, etc., which cannot use range-based versions.


Extended answer

I still can’t understand how UPPER_MASK works. Could you explain this more specifically?

Ignoring namespace problems for characters in headings, you have a series of twelve classification macros:

  • isalpha()
  • isupper()
  • islower()
  • isalnum()
  • isgraph()
  • isprint()
  • iscntrl()
  • isdigit()
  • isblank()
  • isspace()
  • ispunct()
  • isxdigit()

Difference between isspace() and isblank() :

  • isspace() - space ( ' ' ), form feed ( '\f' ), new line ( '\n' ), carriage return ( '\r' ), horizontal tab ( '\t' ) and vertical tab ( '\v' ).
  • isblank() - space ( ' ' ) and horizontal tab ( '\t' ).

There are definitions for these character sets in the C standard and recommendations for the C language standard.

For example, (in the C locale) for "islower ()" or isupper() true if isalpha() true.

I think the necessary bits are:

  • DIGIT_MASK
  • XDIGT_MASK
  • ALPHA_MASK
  • LOWER_MASK
  • UPPER_MASK
  • PUNCT_MASK
  • SPACE_MASK
  • PRINT_MASK
  • CNTRL_MASK
  • BLANK_MASK

From these ten masks, you can create two others:

  • ALNUM_MASK = ALPHA_MASK | DIGIT_MASK
  • GRAPH_MASK = ALNUM_MASK | PUNCT_MASK

You can also use ALPHA_MASK = UPPER_MASK | LOWER_MASK ALPHA_MASK = UPPER_MASK | LOWER_MASK , but some locales have alphabetic characters that are neither uppercase nor lowercase.

So, we can define masks as follows:

 enum CTYPE_MASK { DIGIT_MASK = 0x0001, XDIGT_MASK = 0x0002, LOWER_MASK = 0x0004, UPPER_MASK = 0x0008, ALPHA_MASK = 0x0010, PUNCT_MASK = 0x0020, SPACE_MASK = 0x0040, PRINT_MASK = 0x0080, CNTRL_MASK = 0x0100, BLANK_MASK = 0x0200, ALNUM_MASK = ALPHA_MASK | DIGIT_MASK, GRAPH_MASK = ALNUM_MASK | PUNCT_MASK }; extern unsigned short ctype_bits[]; 

Character set data; the data is shown for the first half of ISO 8859-1, but the same for the first half of all 8859-x code sets. I use the designated C99 initializers as documentary help, even if the entries are okay:

 unsigned short ctype_bits[] = { [EOF +1] = 0, ['\0' +1] = CNTRL_MASK, ['\1' +1] = CNTRL_MASK, ['\2' +1] = CNTRL_MASK, ['\3' +1] = CNTRL_MASK, ['\4' +1] = CNTRL_MASK, ['\5' +1] = CNTRL_MASK, ['\6' +1] = CNTRL_MASK, ['\a' +1] = CNTRL_MASK, ['\b' +1] = CNTRL_MASK, ['\t' +1] = CNTRL_MASK|SPACE_MASK|BLANK_MASK, ['\n' +1] = CNTRL_MASK|SPACE_MASK, ['\v' +1] = CNTRL_MASK|SPACE_MASK, ['\f' +1] = CNTRL_MASK|SPACE_MASK, ['\r' +1] = CNTRL_MASK|SPACE_MASK, ['\x0E'+1] = CNTRL_MASK, ['\x0F'+1] = CNTRL_MASK, ['\x10'+1] = CNTRL_MASK, ['\x11'+1] = CNTRL_MASK, ['\x12'+1] = CNTRL_MASK, ['\x13'+1] = CNTRL_MASK, ['\x14'+1] = CNTRL_MASK, ['\x15'+1] = CNTRL_MASK, ['\x16'+1] = CNTRL_MASK, ['\x17'+1] = CNTRL_MASK, ['\x18'+1] = CNTRL_MASK, ['\x19'+1] = CNTRL_MASK, ['\x1A'+1] = CNTRL_MASK, ['\x1B'+1] = CNTRL_MASK, ['\x1C'+1] = CNTRL_MASK, ['\x1D'+1] = CNTRL_MASK, ['\x1E'+1] = CNTRL_MASK, ['\x1F'+1] = CNTRL_MASK, [' ' +1] = SPACE_MASK|PRINT_MASK|BLANK_MASK, ['!' +1] = PUNCT_MASK|PRINT_MASK, ['"' +1] = PUNCT_MASK|PRINT_MASK, ['#' +1] = PUNCT_MASK|PRINT_MASK, ['$' +1] = PUNCT_MASK|PRINT_MASK, ['%' +1] = PUNCT_MASK|PRINT_MASK, ['&' +1] = PUNCT_MASK|PRINT_MASK, ['\'' +1] = PUNCT_MASK|PRINT_MASK, ['(' +1] = PUNCT_MASK|PRINT_MASK, [')' +1] = PUNCT_MASK|PRINT_MASK, ['*' +1] = PUNCT_MASK|PRINT_MASK, ['+' +1] = PUNCT_MASK|PRINT_MASK, [',' +1] = PUNCT_MASK|PRINT_MASK, ['-' +1] = PUNCT_MASK|PRINT_MASK, ['.' +1] = PUNCT_MASK|PRINT_MASK, ['/' +1] = PUNCT_MASK|PRINT_MASK, ['0' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['1' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['2' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['3' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['4' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['5' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['6' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['7' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['8' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, ['9' +1] = DIGIT_MASK|PRINT_MASK|XDIGT_MASK, [':' +1] = PUNCT_MASK|PRINT_MASK, [';' +1] = PUNCT_MASK|PRINT_MASK, ['<' +1] = PUNCT_MASK|PRINT_MASK, ['=' +1] = PUNCT_MASK|PRINT_MASK, ['>' +1] = PUNCT_MASK|PRINT_MASK, ['?' +1] = PUNCT_MASK|PRINT_MASK, ['@' +1] = PUNCT_MASK|PRINT_MASK, ['A' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['B' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['C' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['D' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['E' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['F' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK|XDIGT_MASK, ['G' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['H' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['I' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['J' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['K' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['L' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['M' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['N' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['O' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['P' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Q' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['R' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['S' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['T' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['U' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['V' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['W' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['X' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Y' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['Z' +1] = ALPHA_MASK|UPPER_MASK|PRINT_MASK, ['[' +1] = PUNCT_MASK|PRINT_MASK, ['\\' +1] = PUNCT_MASK|PRINT_MASK, [']' +1] = PUNCT_MASK|PRINT_MASK, ['^' +1] = PUNCT_MASK|PRINT_MASK, ['_' +1] = PUNCT_MASK|PRINT_MASK, ['`' +1] = PUNCT_MASK|PRINT_MASK, ['a' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['b' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['c' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['d' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['e' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['f' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK|XDIGT_MASK, ['g' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['h' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['i' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['j' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['k' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['l' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['m' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['n' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['o' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['p' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['q' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['r' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['s' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['t' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['u' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['v' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['w' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['x' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['y' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['z' +1] = ALPHA_MASK|LOWER_MASK|PRINT_MASK, ['{' +1] = PUNCT_MASK|PRINT_MASK, ['|' +1] = PUNCT_MASK|PRINT_MASK, ['}' +1] = PUNCT_MASK|PRINT_MASK, ['~' +1] = PUNCT_MASK|PRINT_MASK, ['\x7F'+1] = CNTRL_MASK, ...continue for second half of 8859-x character set... }; #define isalpha(c) ((ctype_bits+1)[c] & ALPHA_MASK) #define isupper(c) ((ctype_bits+1)[c] & UPPER_MASK) #define islower(c) ((ctype_bits+1)[c] & LOWER_MASK) #define isalnum(c) ((ctype_bits+1)[c] & ALNUM_MASK) #define isgraph(c) ((ctype_bits+1)[c] & GRAPH_MASK) #define isprint(c) ((ctype_bits+1)[c] & PRINT_MASK) #define iscntrl(c) ((ctype_bits+1)[c] & CNTRL_MASK) #define isdigit(c) ((ctype_bits+1)[c] & DIGIT_MASK) #define isblank(c) ((ctype_bits+1)[c] & BLANK_MASK) #define isspace(c) ((ctype_bits+1)[c] & SPACE_MASK) #define ispunct(c) ((ctype_bits+1)[c] & PUNCT_MASK) #define isxdigit(c) ((ctype_bits+1)[c] & XDIGT_MASK) 

As already noted, the names here are actually in the namespace reserved for users, so if you looked at the <ctype.h> heading, you will find more cryptic names, and they will probably start with one or two underscores.

+10
source share

The classic compromise is speed versus memory: either calculate the result or look for it in the table.

It is easy to understand how this will look for the isupper() function.

Several things make it possibly unexpectedly complex on today's main processor: s, though:

ASCII support table requires 128 bits or 256 bits if you don't want to mask the topmost bit yourself, assuming an 8-bit char . It is only 32 bytes, but it is probably even more than code that uses the sequential character of ASCII mapping. A large code size is generally inefficient for performance, as it affects cache efficiency and, as a rule, demonstrates a large difference in throughput between today's CPUs: s and their memory subsystems.

Code with explicit comparison to calculate the result without using sequential matching will be quite large, larger than the corresponding lookup table. This is not typical; it is easier to see the difference in the ratio of speed and memory for cases when the code for calculating the value is more compact than the lookup table.

+3
source share

One method can perform some comparisons with 'A', 'Z'.

Another method may use a pre-computed lookup table.

This is the first sentence mentioned on the Wikipedia page for the tradeoff between time and time .

+1
source share

In fact, you can save both space and time. Uppercase characters are adjacent in the ASCII table, so you just need to do this (this, of course, does not process locales, but I doubt that this is the point of your exercise):

 BOOL isUpper(char c) { return (c >= 'A' && c <= 'Z'); } 
0
source share

Uppercase characters are ASCII hex 0x41 - 0x5A. This means that bit 0x40 is always set. If you are sure that the argument is an ASCII character, you can simply return:

(c and 0x40);

0
source share

All Articles