Which is more efficient, reading a word from a file or reading a line at a time and splitting a line with C?

I want to develop a C application where I need to check word by word from a file on disk. I was told that reading a line from a file and then splitting it into words is more efficient, since less access to files is required. It's true?

+6
source share
6 answers

If you know that you need the entire file, you can also read it in large fragments, as you can (in the worst case, you will store the entire map in one file in one folder), you are right that this is due to the fact that less file access required.

But if your program is not slow, write it so that it becomes the fastest and easiest for you. Early optimization is a serious sin.

+8
source

Not entirely true, assuming you're going to use scanf() , and your definition of a word matches what scanf() treats as a word.

The standard I / O library will buffer the actual disk reads, and reading a line or word will have essentially the same I / O cost in terms of disk access. If you read large fragments of a file using fread() , you may get some benefit - at cost in complexity.

But for reading words, it is likely that scanf() and a security string format specifier such as %99s if your array is char word[100]; will work fine and probably easier for code.

If your definition of a word is more complex than the definition supported by scanf() , then reading lines and separating is probably easier.

+5
source

With regard to separation, there is no difference in performance. You separate the spaces in one case and a new one in the other.

However, this would affect the word in such a way that you would need to allocate buffers M times, whereas in the case of strings it would be N times, where M> N. Therefore, if you use a word-splitting approach, first try to calculate the total memory requirement , select this fragment (so that you will not get fragmented fragments of M), and then get M buffers from this fragment. Note that the same approach can be used in line breaks, but the difference will be less noticeable.

+2
source

That's right, you have to read them in a buffer, and then divide them into what you define as "words." The only time this would be wrong is if you can get fscanf() to correctly grab what you think is the word (doubtful).

+1
source

The main performance bottlenecks will be:

  • Any call to the I / O function of the stdio file. The fewer calls, the less overhead.
  • Dynamic memory allocation. As little as possible should be done. Ultimately, many malloc calls will cause heap segmentation.

So it comes down to the classic consideration of programming: you can get either fast execution time, or you can get low memory usage. You cannot get both, but you can find a suitable middle ground that is most effective in terms of both runtime and memory consumption.

In extreme cases, the fastest execution can be obtained by reading the entire file as one large piece and loading it into dynamic memory. Or, as a last resort, you can read it byte by byte and evaluate it when reading, which can make the program slower, but will not use dynamic memory at all.

To optimize the code, you will need fundamental knowledge of various processor-specific and OS-specific functions. All issues, such as alignment, cache layout, efficiency of basic functions of API functions, etc. Etc. Will make a difference.

Why not try a few different ways and compare them?

0
source

Don't actually answer your exact question (words versus strings), but if you still need all the words in memory, then the most efficient approach is this:

  • determine file size
  • allocate a buffer for the entire file plus one byte
  • read the entire file into the buffer and put '\0' in the extra byte.
  • make a passage over it and count how many words he has
  • select char* (pointers to words) or int (indexes for the buffer) index array with counting matches in size
  • make a second pass over the buffer and save the addresses or indexes in the first letters of the words in the index array and overwrite the other bytes in the buffer using '\0' (end of line marker).

If you have a lot of memory, then it’s probably a little faster to just guess the worst case for the number of words: (filesize+1) / 2 (one letter words with one space between them, with an odd number of bytes in the file). Also, using the Java ArrayList or Qt QVector approach with an array of indices and using realloc() to double its size when the number of words exceeds the current capacity will be quite efficient (due to doubling = exponential growth, redistribution will not happen many times).

0
source

All Articles