Choosing a random line from a large file without duplicates

I am trying to select a random line from a large file (> millions of lines) and not select any duplicates. If there is deception, then I want to continue to collect more until a nickname is found.

what I still have:

@already_picked = [] def random_line chosen_line = nil chosen_line_number = nil File.foreach("OSPD4.txt").each_with_index do |line, number| if rand < 1.0/(number+1) chosen_line_number = number chosen_line = line end end chosen_line if @already_picked.include(chosen_line_number)? # what here? else @already_picked << chosen_line_number end end 100.times do |t| random_line end 

I'm not sure what to do in the if clause

+4
source share
4 answers

1 million lines is not very much. if they are 100 bytes / line, then 100 MB in memory. So do a simple thing and move on

 File.readlines("file").sample(100) 

If you start to say more than just fit into memory, the next step is to make one pass through the file to record the line positions, and then just pull the samples out of it.

 class RandomLine def initialize(fn) @file = File.open(fn,'r') @positions = @file.lines.inject([0]) { |m,l| m << m.last + l.size }.shuffle end def pick @file.seek(@positions.pop) @file.gets end end 
+2
source

Your method will probably read most of the file every time you request a random line. Something better, you can read the entire file once and build a table where each line begins (so that you do not have to store all the data in memory). Assuming the file does not change, you can look for a random position in this table and read one line. Faster. One possible implementation:

 class RandomLine def initialize(filename) @file = File.open(filename) @table = [0] @picked = [] File.foreach(filename) do |line| @table << @table.last + line.size end end def pick return nil if @table.size == 0 # if no more lines, nil i = rand(@table.size) # random line @file.seek(@table[i]) # go to the line @table.delete_at(i) # remove from the table line = @file.readline if @picked.include? line pick # pick another line else @picked << line line end end end 

Using:

 random_line = RandomLine.new("OSPD4.txt") 100.times do puts random_line.pick end 
+1
source

While it is very important to go to such work so as not to read the file in memory, a million lines are not so many. An alternative is to simply try a simple solution and only go difficult if it actually slows down in practice.

 class RandomLine def initialize fn open(fn, 'r') { |f| @i, @lines = -1, f.readlines.shuffle } end def pick @lines[@i += 1] end end q = o = RandomLine.new '/etc/hosts' puts q while q = o.pick 
+1
source

When the reader file returns an array of strings, you can just go with the #sample method.

 File.readlines("OSPD4.txt").sample(100).map{|line| line.chomp } # using chomp to get rid of EOL 
+1
source

All Articles