Reordering data from multiple data files

I have 40,000 data files. Each file contains 1445 rows of floating numbers in one column. Now I need to rearrange the data in a different order.

The first number from each data file must be collected and dumped into a new file (say, abc1.dat). This particular file (abc1.dat) will contain 40,000 numbers.

And the second number from each data file should be extracted and dumped into another new file (say, abc2.dat). This new file will also contain 40,000 numbers. But only two numbers from each data file.

At the end of this operation, I assumed that I have 1445 files (abc1.dat, abc2.dat, ... abc40000.dat) and each contains 40,000 data.

How can this be achieved? (Using Linux Ubuntu 11.10 - 64 bit)

Appreciate any help. Thanks.

+4
source share
8 answers

40 000 * 1445 is not so much, it should fit into memory. So in Perl (untested):

#!/usr/bin/perl use strict; use warnings; my @nums; # Reading: for my $file (0 .. 40_000) { open my $IN, '<', "file-$file" or die $!; while (<$IN>) { chomp; $nums[$file][$.-1] = $_; } } # Writing: for my $line (0 .. 1444) { open my $OUT, '>', "abc$line.dat" or die $!; for my $file (0 .. 40_000) { print $OUT $nums[$file][$line], "\n"; } } 
+5
source

If you can open all 1445 output files at once, it's pretty simple:

 paths = ['abc{}.dat'.format(i) for i in range(1445)] files = [open(path, 'w') for path in paths] for inpath in ('input{}.dat'.format(i) for i in range(40000)): with infile as open(inpath, 'r') as infile: for linenum, line in enumerate(infile): files[linenum].write(line) for f in files: f.close() 

If you can put everything in memory (it looks like it should be about 0.5-5.0 GB of data, which may be good for a 64-bit machine with 8 GB of RAM ...), you can do it like this:

 data = [[] for _ in range(1445)] for inpath in ('input{}.dat'.format(i) for i in range(40000)): with infile as open(inpath, 'r') as infile: for linenum, line in enumerate(infile): data[linenum].append(line) for i, contents in enumerate(data): with open('abc{}.dat'.format(i), 'w') as outfile: outfile.write(''.join(contents) 

If none of them fits, you may need some kind of hybrid. For example, if you can make 250 files at once, make 6 batches and skip batchnum * 250 lines in each infile .

If the batch solution is too slow, write infile.tell() at the end of each batch in each file, and when you return to the file, use infile.seek() to return there. Something like that:

 seekpoints = [0 for _ in range(40000)] for batch in range(6): start = batch * 250 stop = min(start + 250, 1445) paths = ['abc{}.dat'.format(i) for i in range(start, stop)] files = [open(path, 'w') for path in paths] for infilenum, inpath in enumerate('input{}.dat'.format(i) for i in range(40000)): with infile as open(inpath, 'r') as infile: infile.seek(seekpoints[infilenum]) for linenum, line in enumerate(infile): files[linenum].write(line) seekpoints[infilenum] = infile.tell() for f in files: f.close() 
+3
source

You can get away with one layer as follows:

 perl -nwe 'open my $fh, ">>", "abc${.}.dat" or die $!; print $fh $_; close ARGV if eof;' input*.dat 

It will open a new output file to add for each line of the input file. The output file will be named according to the current line number of the input file. In the end, we need to explicitly close the ARGV file descriptor before resetting the $. line variable $. .

You can control the order of the input files using your glob or with perl if you want. I chose a common globe, since you did not indicate that the lines should be in a specific order.

Efficiency, I do not think that for each new line it will take a long time to open a new file, since perl works with files quite quickly.

Note that you do not need to close the output file descriptor, as it automatically closes when it goes out of scope. Also note that the size of your file will not be affected.

+2
source

bash:

 cat file1 file2 ... file40000 | split -nr/1445 -d - outputprefix 

Assuming all files have exactly 1445 lines, it is written to outputprefix0000, outputprefix0001, ... outputprefix1444.

A bit slow but it works :)

+2
source

Once the files were created, it took about 4 minutes to run and used 3.6 GB of RAM on my laptop. If your computer has 8 GB of RAM, this should not be a problem.

 #!/usr/bin/env python2.7 import random NUMFILES = 40000 NUMLINES = 1445 # create test files for i in range(1, NUMFILES + 1): with open('abc%s.dat' % i, 'w') as f: for j in range(NUMLINES): f.write('%f\n' % random.random()) data = [] # load all data into memory for i in range(1, NUMFILES + 1): print i with open('abc%s.dat' % i) as f: lines = f.readlines() data.append(lines) # write it back out for j in range(len(data[0])): with open('new_abc%s.dat' % (j + 1), 'w') as f: for i in range(len(data)): f.write(data[i][j]) 

Ive saved everything as strings to avoid precision errors during deserialization and then reinitialization of floating point numbers.


Do you need something faster and less resource intensive that you can run regularly, or is it a one-time conversion?

+1
source

Just for completeness, due to the [fortran] tag, a late example in Fortran. It opens the files one by one and saves all the data in memory.

 program copy implicit none character(1024) :: filename integer :: i, unit, infiles, outfiles parameter (infiles = 40000, outfiles = 1445) real :: data(infiles, outfiles) do i = 1, infiles write(filename, '("path/to/file", I0, ".dat")') i open(newunit = unit, file = filename, action = 'read') read(unit, *) data(i,:) close(unit) enddo do i = 1, outfiles write(filename, '("path/to/abc", I0, ".dat")') i open(newunit = unit, file = filename, action = 'write') write(unit, '(G0)') data(:,i) close(unit) enddo end program 

Note: it will probably be rather slow.

+1
source

In awk, this is very simple:

 awk '{print >> "abc" FNR ".dat}' files* 

I'm not sure awk will be able to handle 40,000 open file descriptors.

0
source

The following shows how Solaris works.

 nawk '{x="abc"FNR".txt";print $1>x}' file1 file2 

you can still:

 nawk '{x="abc"FNR".txt";print $1>x}' file* 

for reference to all 40k files

0
source

All Articles