How can I get exactly n random lines from a file with Perl?

After this , I need to get exactly n lines randomly from a file (or stdin ). It will look like a head or tail , except that I want some of them to be in the middle.

Now, with the exception of the loop through the file with the solutions to the related question, what's the best way to get exactly n lines in one go?

For reference, I tried this:

 #!/usr/bin/perl -w use strict; my $ratio = shift; print $ratio, "\n"; while () { print if ((int rand $ratio) == 1); } 

where $ratio is the rough percentage of the rows I want. For example, if I want 1 out of 10 lines:

 random_select 10 a.list 

However, this does not give me the exact amount:

 aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 ) foreach? random_select 10 a.list | wc -l foreach? end 4739 4865 4739 4889 4934 4809 4712 4842 4814 4817 

Another thought that I had was to split the input file and then select n randomly from the array, but this is a problem if I have a really large file.

Any ideas?

Edit: This is an exact duplicate of this question.

+6
perl random-sample
source share
7 answers

Here is a good one-pass algorithm that I just came up with, having O (N) time complexity and O (M) space complexity, to read M lines from an N-line file.

Suppose that M <= N.

  • Let S be the set of selected rows. Initialize S in the first lines of the M file. If the order of the end result is important, shuffle S now.
  • Read the next line l . So far, we have read n = M + 1 common lines. Therefore, the probability that we want to choose l as one of our final lines, M/n .
  • Take l with probability M/n ; use RNG to decide whether to accept or reject l .
  • If l accepted, randomly select one of the lines in S and replace it with l .
  • Repeat steps 2-4 until the file has run out of lines, increasing n for each new line.
  • Returns the set S selected rows.
+4
source share

This takes a single command line argument, which is the number of lines you want N. The first N lines are held, as you may no longer see. After that, you accidentally decide whether to proceed to the next line. And if you do, you arbitrarily decide which line in the current list is -N to overwrite.

 #!/usr/bin/perl my $bufsize = shift; my @list = (); srand(); while (<>) { push(@list, $_), next if (@list < $bufsize); $list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1); } print foreach @list; 
+2
source share

Possible Solution:

  • scan once to count the number of rows
  • select line number for random selection
  • scan again, select line
+1
source share
 @result = (); $k = 0; while(<>) { $k++; if (scalar @result < $n) { push @result, $_; } else { if (rand <= $n/$k) { $result[int rand $n] = $_; } } } print for @result; 
+1
source share

There is no need to know the actual line number in the file. Just look for a random place and keep the next line. (The current line will most likely be a partial line.)

This approach should be very fast for large files, but it will not work for STDIN. Hell, nothing caches the entire file in memory for STDIN. So, if you should have STDIN, I don’t see how you can be fast / cheap for large files.

You can detect STDIN and switch to a cached approach, otherwise fast.

  #! perl
 use strict;

 my $ file = 'file.txt';
 my $ count = shift ||  10;
 my $ size = -s $ file;

 open (FILE, $ file) ||  die "Can't open $ file \ n";

 while ($ count--) {
    seek (FILE, int (rand ($ size)), 0);
    $ _ = readline (FILE);  # ignore partial line
    redo unless defined ($ _ = readline (FILE));  # catch EOF
    print $ _;
 }
+1
source share

In pseudo code:

 use List::Util qw[shuffle]; # read and shuffle the whole file @list = shuffle(<>); # take the first 'n' from the list splice(@list, ...); 

This is the most trivial implementation, but first you need to read the entire file, which will require sufficient memory.

0
source share

Here are some detailed Perl codecs that should work with large files.

The heart of this code is that it does not save the entire file in memory, but only stores offsets in the file.

Use tell to get offsets. Then seek to the appropriate places to restore the rows.

The best specification of the target file and the number of lines to get are left as an exercise for those who are less lazy than me. These problems have been well resolved.

 #!/usr/bin/perl use strict; use warnings; use List::Util qw(shuffle); my $GET_LINES = 10; my @line_starts; open( my $fh, '<', 'big_text_file' ) or die "Oh, fudge: $!\n"; do { push @line_starts, tell $fh } while ( <$fh> ); my $count = @line_starts; print "Got $count lines\n"; my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1]; for my $start ( @shuffled_starts ) { seek $fh, $start, 0 or die "Unable to seek to line - $!\n"; print scalar <$fh>; } 
0
source share

All Articles