C: Reading a text file (with variable-length lines) line by line using fread () / fgets () instead of fgetc () (blocking input-output or input-output of characters)

Is there a getline function that uses fread (I / O block) instead of fgetc (character input / output)?

There is a performance penalty when reading a file symbol by symbol through fgetc . We believe that to improve performance, we can use block reading via fread in the inner getline loop. However, this leads to a potentially undesirable effect of reading beyond the end of the line. At the very least, implementing a getline to track the "unread" part of a file will require an abstraction other than the ANSI C FILE semantics. This is not what we want to realize ourselves!

We have profiled our application, and slow performance is isolated by the fact that we consume large files with a symbol through fgetc . The rest of the overhead actually has trivial costs compared. We always sequentially read every line of the file, from beginning to end, and we can lock the entire file for the duration of reading. This probably simplifies the implementation of fread getline .

So, is there a getline function that uses fread (block I / O) instead of fgetc (character input / output)? We are sure that this is so, but if not, how can we implement it?

Update Find a helpful article, Handling User Input in C Paul Ssieh. This is an approach based on fgetc , but it has an interesting discussion of alternatives (starting with how bad gets , and then discussing fgets ):

On the other hand, a common retort from C programmers (even those considered experienced) is that fgets () should be used as an alternative. Of course, fgets () itself does not really handle user input as such. Besides the presence of a strange condition for line termination (when \ n or EOF, but not \ 0), the mechanism chosen to complete when the buffer has reached power should just abruptly stop the fgets () operation and \ 0 stop it. Therefore, if user input exceeds the length of the previously allocated buffer, fgets () returns a partial result. To solve this problem, programmers have several options; 1) just handle truncated user input (there is no way to return to the user that the input has been truncated while they provide input). 2) Simulate a growing array of characters and fill it with consecutive calls to fgets () . The first solution is almost always a very poor solution for user input of variable length, because the buffer will inevitably be too large in most cases, because it tries to capture too many ordinary cases and is too small for unusual cases. The second solution is fine, except that it can be difficult to implement correctly. Not related to fgets' odd with respect to '\ 0'.

Reader exercise: to determine how many bytes were actually read by calling fgets () , you can try scanning like 'it for' \ n 'and skip any' \ 0 'without exceeding the size passed to fgets () . Explain why this is not enough for the very last line of the stream. What weakness of ftell () does not completely solve this problem?

Reader Exercise: Solve a problem that determines the length of the data consumed by fgets () by overwriting the entire buffer with a nonzero value between each call to fgets () .

Thus, with fgets (), it remains for us to choose a lot of writing code and live with the condition of line termination, which is incompatible with the rest of the C library or has an arbitrary -off cut. If this is not enough, then what are we left with? scanf () mixes parsing with reading in a way that cannot be split, and fread () will read past the end of the line. In short, the C library leaves us with nothing. We are forced to collapse our own based on fgetc () . So let's take a picture.

So, is there a getline function based on fgets (and not clipping input)?

+6
c file-io fgets fread
source share
2 answers

Do not use fread . Use fgets . I believe this is a homework / class problem, so I am not giving a complete answer, but if you do not say this, I will give more advice. You can definitely provide 100% of the GNU getline style semantics, including embedded null bytes, using purely fgets , but that requires smart thinking.

OK, update, as this is not homework:

  • memset your buffer to '\n' .
  • Use fgets .
  • Use memchr to find the first '\n' .
  • If '\n' not found, the line is longer than your buffer. Download the buffer, fill the new portion of '\n' and fgets into the new part, repeating as necessary.
  • If the character following '\n' is equal to '\0' , then fgets terminates because it reaches the end of the line.
  • Otherwise, fgets terminates due to EOF achievement, '\n' left from your memset , the previous character is the ending zero that fgets wrote, and the character before that is the last character the actual data is considered.

You can exclude memset and use strlen instead of memchr if you do not care about supporting strings with embedded zeros (in any case, zero does not stop reading, it will just be part of your input / output line).

There is also a way to do the same with fscanf and the specifier "%123[^\n]" (where 123 is your buffer limit), which allows you to dwell on characters other than the newline (ala GNU getdelim ). However, this is probably slow if your system does not have a very fancy scanf implementation.

+5
source share

There is not much performance difference between fgets and fgetc / setvbuf. Try:

 int c; FILE *f = fopen("blah.txt","r"); setvbuf(f,NULL,_IOLBF,4096); /* !!! check other values for last parameter in your OS */ while( (c=fgetc(f))!=EOF ) { if( c=='\n' ) ... else ... } 
+1
source share

All Articles