Convert a number of matrix files to an index of coordinates in awk

I have time series files 0000.vx.dat, 0000.vy.dat, 0000.vz.dat; ...; 0077.vx.dat, 0077.vy.dat, 0077.vz.dat ... Each file is a 2D matrix divided by space. I would like to take each triplet of files and combine them all into a data format based on coordinates, that is:

[timestep + 1] [i] [j] [vx (i, j)] [vy (i, j)] [vz (i, j)]

Each file number corresponds to a specific time step. Given the amount of data that I have in this time series (~ 4 GB), bash didn’t shorten it, so it seemed to be time to approach awk ... specifically mawk. It was pretty stupid to try this in bash, but here is my ill-fated attempt:

for x in $(seq 1 78) do tfx=${tf[$x]} # an array of padded zeros for y in $(seq 1 1568) do for z in $(seq 1 1344) do echo $x $y $z $(awk -vi=$z -vj=$y "FNR == i {print j}" $tfx.vx.dat) $(awk -vi=$z -vj=$y "FNR == i {print j}" $tfx.vy.dat) $(awk -vi=$z -vj=$y "FNR == i {print j}" $tfx.vz.dat) >> $file done done done 

edit: Thanks, ruakh, for pointing out that I saved j in a variable shell format with $ in front! This is just a fragment of the original script, but I think it will be considered his gut!

Suffice it to say that it would take about six months due to all the bash memory overhead associated with O (MxN) algorithms, subshells and pipes, and much more. I searched most during the day for maximum. Each file is about 18 MB, so this should not be a problem. I would be happy to do this one timestep at a time in awk, provided that I get one output file at a time. I think I could just get around them all without much trouble. It is important, however, that the time step number be the first item in the coordinate list. I could achieve this with the awk -v argument (see above) using the bash procedure. I don’t know how to search for certain matrix elements in three separate files and combine them into one output. This is the main obstacle that I would like to overcome. I was hoping mawk could provide a good balance between effort and computational speed. If this seems too big for an awk script, I could go for something lower and appreciate any of those who answered, letting me know that I should just go to C.

Thank you in advance! I really like awk, but I'm afraid I'm new.

Three files, 0000.vx.dat, 0000.vy.dat and 0000.vz.dat will be read as follows (except for huge and correct sizes):

0000.vx.dat:

 1 2 3 4 5 6 7 8 9 

0000.vy.dat:

 10 11 12 13 14 15 16 17 18 

0000.vz.dat:

 19 20 21 22 23 24 25 26 27 

I would like to be able to enter:

 awk -vt=1 -f stackoverflow.awk 0000.vx.dat 0000.vy.dat 0000.vz.dat 

and get the following output:

 1 1 1 1 10 19 1 1 2 2 11 20 1 1 3 3 12 21 1 2 1 4 13 22 1 2 2 5 14 23 1 2 3 6 15 24 1 3 1 7 16 25 1 3 2 8 17 26 1 3 3 9 18 27 

edit: Thanks shellter for suggesting clear input and output!

+4
source share
1 answer

Personally, I use gawk to process most of my text files. However, since you requested a mawk solution, here is one way to solve your problem. Run in the current working directory:

 for i in *.vx.dat; do nawk -f script.awk "$i" "${i%%.*}.vy.dat" "${i%%.*}.vz.dat"; done 

The content of script.awk :

 FNR==1 { FILENAME++ c=0 } { for (i=1;i<=NF;i++) { c++ a[c] = (a[c] ? a[c] : FILENAME FS NR FS i) FS $i } } END { for (j=1;j<=c;j++) { print a[j] > sprintf("%04d.dat", FILENAME) } } 

When you run above, the results should be one file for each set of three files containing your coordinates. These output files will have file names in the form: timestamp + 1 ".dat". I decided to spell these names with four 0s for your convenience. But you can change this in any format. Here are the results that I get from the example data you published. Content 0001.dat :

 1 1 1 1 10 19 1 1 2 2 11 20 1 1 3 3 12 21 1 2 1 4 13 22 1 2 2 5 14 23 1 2 3 6 15 24 1 3 1 7 16 25 1 3 2 8 17 26 1 3 3 9 18 27 
+2
source

All Articles