I am dealing with an application that must accidentally read the entire line of text from a series of potentially large text files (~ 3 + GB).
Lines can have different lengths.
To reduce GC and create unnecessary lines, I use the solution provided at: Is there a better way to determine the number of lines in a large txt file (1-2 GB)? to detect each new line and save it on the map in one pass, thus creating the lineNo => position index, that is:
// maps each line to it corresponding fileStream.position in the file List<int> _lineNumberToFileStreamPositionMapping = new List<int>();
- view the whole file
- when detecting the increment of
new line lineCount and add fileStream.Position to _lineNumberToFileStreamPositionMapping
Then we use an API similar to:
public void ReadLine(int lineNumber) { var getStreamPosition = _lineNumberToFileStreamPositionMapping[lineNumber];
This solution currently provides good performance, but I don't like two things:
- Since I donβt know the total number of lines in the file, I cannot pre-allocate an
array , so I need to use a List<int> , which has the potential inefficiency of resizing, to double what I really need; - Memory usage, since the sample text file is ~ 1 GB with ~ 5 million lines of text, the index takes ~ 150 MB. I would really like to reduce this as little as possible.
Any ideas are greatly appreciated.
MaYaN source share