Tip for reading a 50 gigabyte file (and copying it to 16K files)!

Question

Tip for reading a 50 gigabyte file (and copying it to 16K files)!

I have a huge file (almost 50 GB, only an ASCII matrix of 360K lines, each with 15K numbers), and I need to transpose it. In order not to read everything in memory, I simply wrote a Perl script that opens 15K files (one for each matrix column) and continues by reading the full line of the input file and writing each number to the end of its corresponding file (the first number to output column0 file. txt, the second number to output the file column1.txt, etc.).

Everything looked promising: the code uses only a constant 178 MB of RAM, and the initial tests with only part of the input file work fine: it processed 3600 lines in about a minute, so I hoped that all this would be done in about two hours, but when I run the real thing is, the code stops at many points. For example, at the beginning it processed ~ 4600 lines very quickly, and then stopped for a rather long time (maybe 5-10 minutes) before continuing. Right now, after ~ 10 hours of calculation, he processed 131K lines, and the code stops for two to three minutes after processing 300-400 lines.

I have never worked with such large input files or so many open files, so I'm not sure if the problem is with the input or the number of file descriptors. Any tips on how to diagnose (and hopefully) solve the speed problem? I include the relevant part of the program below

thanks

====================================

for ($i=0 ; $i<$columnas ; $i++) { $column[$i] = IO::File->new(">column$i.txt") or die $!; } while (<DATA>) { chomp; $cols = split; for ($col=0 ; $col<$cols ; $col++) { print { $column[$col] } "$_[$col] " ; } } close (DATA) or die $!;

+4

perl

Angel de vicente May 03 '11 at 8:56

source share

3 answers

weismat · Answer 1 · 2011-05-03T11:07:56+0000

Check / proc / sys / fs / file-max to see the maximum number of open files.
You may need to read files with seek so that you can control the number of open files to read accordingly.
Your best bet would be to cache the x lines, and then add them to all the files.

TLP · Answer 2 · 2011-05-03T16:15:41+0000

Some thoughts

1. Implicit splitting on @_

 $cols = split;

Provides a warning:

 Use of implicit split to @_ is deprecated

If you have not already done so, you must add

 use warnings; use strict;

to your script. (And heed these warnings.)

Consider changing $cols to @cols and instead using $#cols in a for loop. For instance.

 @cols = split; for (my $col=0; $col <= $#cols; $col++)

2. Is chomp required?

From split() to perlfunc :

If PATTERN is also omitted, it splits into a space (after skipping any leading space).

This means that your newline must also be removed, as it is considered a space character.

Therefore, chomp() not required.

3. The number of open files

I believe that perl open() is pretty fast, so it might be worth caching your data like weismat for example. While you are doing this, you can share a single file descriptor for all files and only open them when printing the cache. For instance:.

 for ($i = 0; $i <= $#column; $i++) { open OUT, ">> column$i.txt" or die $!; print OUT $column[$i]; }

ETA: @column This column contains columns migrated from DATA. Instead of printing, use:

 $column[$col] .= $cols[$col] . " ";

ysth · Answer 3 · 2011-05-03T11:05:05+0000

Given that you have strange things as a result, checking the success of your fingerprints might be a good idea:

 print { $column[$col] } "$_[$col] " or die "Error printing column $col: $! ";

Try flushing every 500 lines or so? use IO::Handle; and after printing:

 if ( $. % 500 == 0 ) { $column[$col]->flush() or die "Flush of column $col failed: $! "; }

Tip for reading a 50 gigabyte file (and copying it to 16K files)!

More articles: