Read a random line from a large text file

I have a file with 5000 + lines. I want to find the most efficient way to select one of these lines every time I run my program. I originally intended to use a random method to select one (this was before I realized that there were 5,000 rows). Thought it might be inefficient, so I thought I'd look at reading the first line, and then remove it from the top and add it to the bottom. But it seems to me that I should read the whole file and create a new file for deletion from above.

What is the most efficient way: a random method or a new file method?

The program will run every 5 minutes and I use C # 4.5

+6
source share
4 answers

In .NET 4. *, you can directly access a single line in a file. For example, to get the string X:

string line = File.ReadLines(FileName).Skip(X).First(); 

Full example:

 var fileName = @"C:\text.txt" var file = File.ReadLines(fileName).ToList(); int count = file.Count(); Random rnd = new Random(); int skip = rnd.Next(0, count); string line = file.Skip(skip).First(); Console.WriteLine(line); 
+2
source

Suppose the file is so large that you cannot afford to put it in RAM. Then you want to use Reservoir Sampling - an algorithm designed to randomly select randomness from lists of unknown arbitrary length that may not fit into memory:

 Random r = new Random(); int currentLine = 1; string pick = null; foreach (string line in File.ReadLines(filename)) { if (r.Next(currentLine) == 0) { pick = line; } ++currentLine; } return pick; 

At a high level, the tank is sampled according to the basic rule: each subsequent line has a 1 / N chance to replace all previous lines.

This algorithm is a bit unintuitive. At a high level, it works if line N has a 1 / N chance of replacing the currently selected row. Thus, line 1 has a 100% chance of being selected, but the probability of 50% will later be replaced by line 2.

I found the understanding of this algorithm most easily in the form of proof of correctness. So, a simple proof by induction:

1) Base register: for verification, the algorithm works if there is 1 line.
2) If the algorithm works for lines N-1, processing N lines works, because:
3) After processing N-1 iterations of a file of line N, all lines of N-1 are equally likely (probability 1 / (N-1)).
4) The next iteration ensures that line N has a probability of 1 / N (because what the algorithm explicitly assigns to it is the last iteration), reducing the probability of all previous lines:

 1/(N-1) * (1-(1/N)) 1/(N-1) * (N/N-(1/N)) 1/(N-1) * (N-1)/N (1*(N-1)) / (N*(N-1)) 1/N 

If you know how many lines are in a file in advance, this algorithm is more expensive than necessary, since it always reads the entire file.

+1
source

I assume the goal is to randomly select one line from a file of 5000 lines.

Try the following:

  • Get the number of rows using File.ReadLines (file) .Count ().
  • Create a random number using the string counter as the upper limit.
  • Do a lazy reading of the file with File.ReadLines (file).
  • Select a row from this array using a random number.

EDIT: as indicated, executing File.ReadLines (file) .toArray () is pretty inefficient.

0
source

Here's a quick implementation of @LucasTrzesniewski suggested the method in the comments on the question:

 // open the file using(FileStream stream = File.OpenRead("yourfile.dat")) { // 1. index all offsets that are the beginning of a line List<Long> lineOffsets = new List<Long>(); lineOffsets.Add(stream.Position); //the very first offset is a beginning of a line! int ch; while((ch = stream.ReadByte()) != -1) // "-1" denotes the end of the file { if(ch == '\n') lineOffsets.Add(stream.Position); } // 2. read a random line stream.Seek(0, SeekOrigin.Begin); // go back to the beginning of the file // set the position of the stream to one the previously saved offsets stream.Position = lineOffsets[new Random().Next(lineOffsets.Count)]; // read the whole line from the specified offset using(StreamReader reader = new StreamReader(stream)) { Console.WriteLine(reader.ReadLine()); } } 

At the moment, I do not have VS near me, so this is not verified.

0
source

All Articles