Is it possible to confuse EOF with a regular byte value when using fgetc?

Question

Is it possible to confuse EOF with a regular byte value when using fgetc?

We often use fgetc as follows:

 int c; while ((c = fgetc(file)) != EOF) { // do stuff }

Theoretically, if the byte in the file is EOF , this code is an error - it will break the loop early and will not process the entire file. Is this possible?

As far as I understand, fgetc internally transfers the byte read from the file to unsigned char , and then to int and returns it. This will work if the range of int greater than the range of unsigned char .

What happens if it is not (maybe then sizeof(int)=1 )?

Will fgetc read valid data equal to EOF from a file sometimes?
Will it modify the data it reads from the file to avoid a single EOF value?
Will fgetc an unrealized function?
Will there be another type of EOF , such as long ?

I could make my code flawless with an extra check:

 int c; for (;;) { c = fgetc(file); if (feof(file)) break; // do stuff }

Is this necessary if I want maximum portability?

+4

c language-lawyer binaryfiles fgetc

anatolyg Sep 17 '15 at 23:11

source share

3 answers

The C specification states that an int should be able to hold values from -32767 to 32767 at a minimum. Any platform with a smaller int non-standard.

The C specification also states that EOF is a negative constant of int and that fgetc returns "a unsigned char converted to int " if read successfully. Since unsigned char cannot have a negative value, the EOF value can be distinguished from any read from the stream. ^*

^* See below a case of loopholes in which this fails.

Corresponding standard text (from C99):

§5.2.4.2.1 Dimensions of integer types <limits.h> :
[Values] defined by the implementation must be equal or greater in magnitude (in absolute value) to those shown with the same sign.
[...]
- minimum value for an object of type int
  INT_MIN -32767
- maximum value for an object of type int
  INT_MAX +32767
§7.19.1 <stdio.h> - Introduction
EOF ... expands to an integer constant expression of type int and a negative value that is returned by several functions to indicate the end of the file, i.e. there is no more input from the stream
§7.19.7.1 fgets function
If the end-of-file indicator for the input stream pointed to by stream is not set and the next character is present, the fgetc function receives this character as an unsigned char converted to int and advances the corresponding file position indicator for the stream (if one is defined)

If UCHAR_MAX ≤ INT_MAX , there is no problem: all unsigned char values will be converted to non-negative integers, so they will be different from EOF.

Now there is a funny loophole: if the system has UCHAR_MAX > INT_MAX , then it is legally allowed to convert values greater than INT_MAX into negative integers (in accordance with § 6.3.3.3, the result of converting a value to a signed type that cannot represent this value is determined implementation), allowing you to convert a character read from a stream to EOF.

There are systems with CHAR_BIT > 8 (e.g. TI C4x DSP, which apparently uses 32-bit bytes), although I'm not sure if they are broken relative to the EOF and stream functions.

+5

nneonneo Sep 17 '15 at 23:17

source share

NOTE. The chux answer is correct in the most general case. I leave this answer because I believe that both the answer and discussion in the comments are valuable for understanding the (rare) situations in which the chux approach is needed.

EOF is guaranteed to have a negative value (C99 7.19.1), and, as you mentioned, fgetc reads its input as an unsigned char before converting to int. Therefore, they themselves guarantee that EOF cannot be read from a file.

Regarding your specific questions:

fgetc cannot read legal date equal to EOF. There is no such thing as signed or unsigned in the file; these are just bit sequences. This is C, which interprets 1000 1111 differently depending on whether it is treated as signed or unsigned. fgetc must be considered unsigned, so negative numbers (except EOF) cannot be returned.
Application: it cannot read EOF for the unsigned char part, but when it converts unsigned char to int, if int is not able to represent all unsigned char values, then the behavior is (6.3.1.3).
fgetc is required by the standard for hosted deployments, but standalone implementations are allowed to skip most of the standard library functions (some of them seem to be necessary, but I could not find the list.)
EOF will not take a long time, since fgetc should be able to return it, and fgetc will return int.
As for changing data, it cannot exactly change the value but since fgetc is specified to read "characters" from a file as opposed to characters, it can potentially be read 8 bits at a time, even if the system otherwise sets CHAR_BIT to 16 (which is the minimum value that sizeof (int) = = 1, since INT_MIN <= -32767 and INT_MAX> = 32767 are required by clause 5.2.4.2). In this case, the input character will be converted to an unsigned char , which always always had its high bits of 0. Then he could do the conversion to int without losing accuracy. (In practice, this simply won’t come, since machines usually do not have 16-bit bytes)

0

Ray Sep 17 '15 at 23:40

source share

chux · Accepted Answer · 2015-09-18T01:06:27+0000

Yes, c = fgetc(file); if (feof(file)) c = fgetc(file); if (feof(file)) works for maximum portability. It works in general, and also when unsigned char and int have the same number of unique values. This happens on rare platforms with char , signed char , unsigned char , short , unsigned short , int , unsigned all using the same width and range width.

Please note that feof(file)) not enough. The code should also check for ferror(file) .

 int c; for (;;) { c = fgetc(file); if (c == EOF) { if (feof(file)) break; if (ferror(file)) break; } // do stuff }

Is it possible to confuse EOF with a regular byte value when using fgetc?

More articles: