How can I index a file efficiently?

I am dealing with an application that must accidentally read the entire line of text from a series of potentially large text files (~ 3 + GB).

Lines can have different lengths.

To reduce GC and create unnecessary lines, I use the solution provided at: Is there a better way to determine the number of lines in a large txt file (1-2 GB)? to detect each new line and save it on the map in one pass, thus creating the lineNo => position index, that is:

 // maps each line to it corresponding fileStream.position in the file List<int> _lineNumberToFileStreamPositionMapping = new List<int>(); 
  • view the whole file
  • when detecting the increment of new line lineCount and add fileStream.Position to _lineNumberToFileStreamPositionMapping

Then we use an API similar to:

 public void ReadLine(int lineNumber) { var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber]; //... set the stream position, read the byte array, convert to string etc. } 

This solution currently provides good performance, but I don't like two things:

  • Since I don’t know the total number of lines in the file, I cannot pre-allocate an array , so I need to use a List<int> , which has the potential inefficiency of resizing, to double what I really need;
  • Memory usage, since the sample text file is ~ 1 GB with ~ 5 million lines of text, the index takes ~ 150 MB. I would really like to reduce this as little as possible.

Any ideas are greatly appreciated.

+6
source share
1 answer
  • Use List.Capacity to manually increase capacity, perhaps every 1000 lines or so.

  • If you want to exchange performance for memory, you can do this: instead of storing the positions of each line, save only the positions of each 100th (or something) line. Then, when, for example, line 253 is required, go to the position of line 200 and count forward 53 lines.

+3
source

All Articles