Effective way to transpose a file in Bash

I have a huge partition-delimited file formatted this way

X column1 column2 column3 row1 0 1 2 row2 3 4 5 row3 6 7 8 row4 9 10 11 

I would like to efficiently transpose it using only bash commands (I could write ten or so lines of a Perl script for this, but it should be slower to execute than native bash functions). So the result should look like

 X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11 

I thought about such a decision

 cols=`head -n 1 input | wc -w` for (( i=1; i <= $cols; i++)) do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output done 

But it is slow and does not seem to be the most effective solution. I saw the solution for vi in this post , but it is still too slow. Any thoughts / suggestions / brilliant ideas ?:-)

+99
unix bash parsing transpose
Nov 13 '09 at 15:13
source share
25 answers
 awk ' { for (i=1; i<=NF; i++) { a[NR,i] = $i } } NF>p { p = NF } END { for(j=1; j<=p; j++) { str=a[1,j] for(i=2; i<=NR; i++){ str=str" "a[i,j]; } print str } }' file 

Exit

 $ more file 0 1 2 3 4 5 6 7 8 9 10 11 $ ./shell.sh 0 3 6 9 1 4 7 10 2 5 8 11 

Performance versus Jonathan Perl solution in 10,000 line file

 $ head -5 file 1 0 1 2 2 3 4 5 3 6 7 8 4 9 10 11 1 0 1 2 $ wc -l < file 10000 $ time perl test.pl file >/dev/null real 0m0.480s user 0m0.442s sys 0m0.026s $ time awk -f test.awk file >/dev/null real 0m0.382s user 0m0.367s sys 0m0.011s $ time perl test.pl file >/dev/null real 0m0.481s user 0m0.431s sys 0m0.022s $ time awk -f test.awk file >/dev/null real 0m0.390s user 0m0.370s sys 0m0.010s 

EDIT by Ed Morton (@ ghostdog74 feel free to delete if you reject).

Perhaps this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script does. It also uses tabs as a separator, originally requested by the OP to handle empty fields, and it coincides to exaggerate the output for this particular case.

 $ cat tst.awk BEGIN { FS=OFS="\t" } { for (rowNr=1;rowNr<=NF;rowNr++) { cell[rowNr,NR] = $rowNr } maxRows = (NF > maxRows ? NF : maxRows) maxCols = NR } END { for (rowNr=1;rowNr<=maxRows;rowNr++) { for (colNr=1;colNr<=maxCols;colNr++) { printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS) } } } $ awk -f tst.awk file X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11 

The above solutions will work in any awk (except for the old, broken awk, of course, there is YMMV).

The above solutions really read the entire file in memory - if the input files are too large for this, you can do this:

 $ cat tst.awk BEGIN { FS=OFS="\t" } { printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND } ENDFILE { print "" if (ARGIND < NF) { ARGV[ARGC] = FILENAME ARGC++ } } $ awk -f tst.awk file X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11 

which uses almost no memory, but reads the input file once per the number of fields in the line, so it will be much slower than the version that reads the entire file into memory. He also assumes the number of fields is the same for each row and uses GNU awk for ENDFILE and ARGIND , but any awk can do the same with tests for FNR==1 and END .

+99
Nov 13 '09 at 15:34
source share

Another option is to use rs :

 rs -c' ' -C' ' -T 

-c changes the input column delimiter, -c changes the output column delimiter, and -T rows and columns. Do not use -T instead of -T because it uses an automatically calculated number of rows and columns, which is usually not correct. rs , which is named after the reshape function in the APL, ships with BSD and OS X, but it should be available to package managers on other platforms.

The second option is to use Ruby:

 ruby -e'puts readlines.map(&:split).transpose.map{|x|x*" "}' 

The third option is to use jq :

 jq -R .|jq -sr 'map(./" ")|transpose|map(join(" "))[]' 

jq -R . prints each input line as a JSON string literal, -s ( --slurp ) creates an array for the input strings after parsing each line as JSON, and -r ( --raw-output ) displays the contents of the strings instead of JSON string literals. The / operator is overloaded to separate lines.

+40
May 11 '15 at 17:28
source share

Python solution:

 python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output 

The above is based on the following:

 import sys for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())): print(' '.join(c)) 

This code assumes that each row has the same number of columns (no indentation is performed).

+29
Nov 13 '09 at 17:21
source share

The transpose project on sourceforge is a program like Coreutil for that.

 gcc transpose.c -o transpose ./transpose -t input > output #works with stdin, too. 
+20
Feb 08 '13 at
source share

Clean BASH, no additional process. Good exercise:

 declare -a array=( ) # we build a 1-D-array read -a line < "$1" # read the headline COLS=${#line[@]} # save number of columns index=0 while read -a line ; do for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < COLS; ROW++ )); do for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do printf "%s\t" ${array[$COUNTER]} done printf "\n" done 
+15
Nov 19 '09 at 15:11
source share

Check out the GNU datamash , which can be used as a datamash transpose . The future version also supports crosstabs (pivot tables)

+11
Jan 07 '16 at 9:08 on
source share

It runs a moderately hard Perl script to do the job. There are many structural counterparts with the @ ghostdog74 awk solution.

 #!/bin/perl -w # # SO 1729824 use strict; my(%data); # main storage my($maxcol) = 0; my($rownum) = 0; while (<>) { my(@row) = split /\s+/; my($colnum) = 0; foreach my $val (@row) { $data{$rownum}{$colnum++} = $val; } $rownum++; $maxcol = $colnum if $colnum > $maxcol; } my $maxrow = $rownum; for (my $col = 0; $col < $maxcol; $col++) { for (my $row = 0; $row < $maxrow; $row++) { printf "%s%s", ($row == 0) ? "" : "\t", defined $data{$row}{$col} ? $data{$row}{$col} : ""; } print "\n"; } 

At the size of the sample data, the performance difference between perl and awk was negligible (1 millisecond out of 7 points). With a large dataset (100x100 matrix, records of 6-8 characters each), perl is slightly superior to awk - 0.026s versus 0.042. None of them can be a problem.




Typical timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when specifying "-V") vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 per file containing 10,000 lines with 5 columns per row:

 Osiris JL: time gawk -f tr.awk xxx > /dev/null real 0m0.367s user 0m0.279s sys 0m0.085s Osiris JL: time perl -f transpose.pl xxx > /dev/null real 0m0.138s user 0m0.128s sys 0m0.008s Osiris JL: time awk -f tr.awk xxx > /dev/null real 0m1.891s user 0m0.924s sys 0m0.961s Osiris-2 JL: 

Note that gawk is much faster than awk on this machine, but still slower than perl. Obviously your mileage will be different.

+9
Nov 14 '09 at 19:54
source share

If you have sc installed, you can do:

 psc -r < inputfile | sc -W% - > outputfile 
+6
Nov 13 '09 at 16:54
source share

There is a special utility for this,

GNU datamash utility

 apt install datamash datamash transpose < yourfile 

Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

+6
Apr 7 '17 at 9:00
source share

Assuming all your lines have the same number of fields, this awk program solves the problem:

 {for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]} 

In words, when you loop through lines, for each field f express a line with a separator :: - col[f] containing the elements of this field. After you are done with all the lines, print each of these lines on a separate line. Then you can substitute ':' for the separator you want (say, a space) by supplying output through tr ':' ' ' .

Example:

 $ echo "1 2 3\n4 5 6" 1 2 3 4 5 6 $ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' ' 1 4 2 5 3 6 
+5
Jun 10 '15 at 17:57
source share

The hackish perl solution might be like this. This is good because it does not load the entire file into memory, prints intermediate temporary files, and then uses all the wonderful paste

 #!/usr/bin/perl use warnings; use strict; my $counter; open INPUT, "<$ARGV[0]" or die ("Unable to open input file!"); while (my $line = <INPUT>) { chomp $line; my @array = split ("\t",$line); open OUTPUT, ">temp$." or die ("unable to open output file!"); print OUTPUT join ("\n",@array); close OUTPUT; $counter=$.; } close INPUT; # paste files together my $execute = "paste "; foreach (1..$counter) { $execute.="temp$counter "; } $execute.="> $ARGV[1]"; system $execute; 
+3
Nov 13 '09 at 15:49
source share

The only improvement I see in your own example is the use of awk, which will reduce the number of running processes and the amount of data that is transferred between them:

 /bin/rm output 2> /dev/null cols=`head -n 1 input | wc -w` for (( i=1; i <= $cols; i++)) do awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input done >> output 
+3
Nov 13 '09 at 16:08
source share

I usually use this little awk snippet for this requirement:

  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i max=(max<NF?NF:max)} END {for (i=1; i<=max; i++) {for (j=1; j<=NR; j++) printf "%s%s", a[i,j], (j==NR?RS:FS) } }' file 

It just loads all the data into a two-dimensional array a[line,column] and then prints it as a[column,line] , so that it wraps the given input.

To do this, you need to track the number of max imum columns that the source file has, so it is used as the number of lines to print. A.

+3
May 12 '15 at 7:48
source share

The GNU datamash is ideal for this task, with just one line of code and potentially arbitrarily large file size!

 datamash -W transpose infile > outfile 
+3
07 Sep '17 at 15:00
source share

I used the fgm solution (thanks fgm!), But I had to remove the tabs at the end of each line, so I changed the script like this:

 #!/bin/bash declare -a array=( ) # we build a 1-D-array read -a line < "$1" # read the headline COLS=${#line[@]} # save number of columns index=0 while read -a line; do for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < COLS; ROW++ )); do for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do printf "%s" ${array[$COUNTER]} if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ] then printf "\t" fi done printf "\n" done 
+2
Mar 21 '10 at 22:39
source share

I was just looking for a similar bash tranpose, but with padding support. Here is the script I wrote based on the fgm solution, which seems to work. If this can help ...

 #!/bin/bash declare -a array=( ) # we build a 1-D-array declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row SEPARATOR="\t"; PADDING=""; MAXROWS=0; index=0 indexCol=0 while read -a line; do ncols[$indexCol]=${#line[@]}; ((indexCol++)) if [ ${#line[@]} -gt ${MAXROWS} ] then MAXROWS=${#line[@]} fi for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < MAXROWS; ROW++ )); do COUNTER=$ROW; for (( indexCol=0; indexCol < ${#ncols[@]}; indexCol++ )); do if [ $ROW -ge ${ncols[indexCol]} ] then printf $PADDING else printf "%s" ${array[$COUNTER]} fi if [ $((indexCol+1)) -lt ${#ncols[@]} ] then printf $SEPARATOR fi COUNTER=$(( COUNTER + ncols[indexCol] )) done printf "\n" done 
+2
Jan 30 '14 at 5:27
source share

Not very elegant, but this "one-line" command quickly solves the problem:

 cols=4; for((i=1;i<=$cols;i++)); do \ awk '{print $'$i'}' input | tr '\n' ' '; echo; \ done 

Here cols is the number of columns where you can replace 4 with head -n 1 input | wc -w head -n 1 input | wc -w .

+2
May 6 '14 at 21:41
source share

I was looking for a solution to transpose any matrix (nxn or mxn) with any data (numbers or data) and got the following solution:

 Row2Trans=number1 Col2Trans=number2 for ((i=1; $i <= Line2Trans; i++));do for ((j=1; $j <=Col2Trans ; j++));do awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i done done paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO 
+2
Aug 6 '14 at 1:10
source share

If you want to grab a single line with comma delimited $ N from a file and turn it into a column:

 head -$N file | tail -1 | tr ',' '\n' 
+2
Nov 06
source share

Another awk solution and the limited memory input you have.

 awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) } END{ for (i in RtoC) print RtoC[i] }' infile 

This combines all the same positions with the position together and prints the result in END , which will be the first row in the first column, the second row in the second column, etc. It will display:

 X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11 
+2
Sep 19 '18 at 16:51
source share
 #!/bin/bash aline="$(head -n 1 file.txt)" set -- $aline colNum=$# #set -x while read line; do set -- $line for i in $(seq $colNum); do eval col$i="\"\$col$i \$$i\"" done done < file.txt for i in $(seq $colNum); do eval echo \${col$i} done 

another version with set eval

+1
Aug 19 '15 at 7:43 on
source share

Some * nix standard one-liners utilities, temporary files are not needed. NB: OP wanted an effective fix (i.e. faster), and better answers are usually faster than this answer. These one-liners are designed for those who love * nix software tools for one reason or another. In rare cases (for example, I / O and memory shortages), these fragments may be faster than some of the most popular answers.

Name the input file foo.

  1. If we know that foo has four columns:

     for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done 
  2. If we do not know how many columns foo has:

     n=$(head -n 1 foo | wc -w) for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done 

    xargs has a size limit and therefore can lead to incomplete work with a long file. What size limit is system dependent, for example:

     { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max 

    Maximum team length we could actually use: 2088944

  3. tr & echo :

     for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done 

    ... or if the number of columns is unknown:

     n=$(head -n 1 foo | wc -w) for f in $(seq 1 $n); do cut -d ' ' -f $f foo | tr '\n' ' ' ; echo done 
  4. Using set , which, like xargs , has similar command line size limits:

     for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done 
+1
Apr 10 '16 at 10:52 on
source share

Here is the Haskell decision. When compiled with -O2, it runs a little faster than ghostdog awk and a bit slower than Stephan thinly wrapped in python on my machine for repeated "Hello world" input lines. Unfortunately, GHC support for passing command-line code does not exist, as far as I can tell, so you have to write it to a file yourself. It truncates the lines to the length of the shortest line.

 transpose :: [[a]] -> [[a]] transpose = foldr (zipWith (:)) (repeat []) main :: IO () main = interact $ unlines . map unwords . transpose . map words . lines 
0
Aug 26 '14 at 3:03
source share

The following is one paste -linking them together:

 echo '' > tmp1; \ cat m.txt | while read l ; \ do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \ cp tmp2 tmp1; \ done; \ cat tmp1 

m.txt:

 0 1 2 4 5 6 7 8 9 10 11 12 
  • creates a tmp1 file tmp1 that it is not empty.

  • reads each row and converts it to a column using tr

  • inserts a new column into the tmp1 file

  • copies the result back to tmp1 .

PS: I really wanted to use io descriptors, but couldn't make them work.

0
Dec 07 '14 at 3:08
source share

Awk solution that stores the entire array in memory

  awk '$0!~/^$/{ i++; split($0,arr,FS); for (j in arr) { out[i,j]=arr[j]; if (maxr<j){ maxr=j} # max number of output rows. } } END { maxc=i # max number of output columns. for (j=1; j<=maxr; j++) { for (i=1; i<=maxc; i++) { printf( "%s:", out[i,j]) } printf( "%s\n","" ) } }' infile 

But we can “walk” the file as many times as the output lines are needed:

 #!/bin/bash maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)" rowcount=maxf for (( i=1; i<=rowcount; i++ )); do awk -vi="$i" -F " " '{printf("%s\t ", $i)}' infile echo done 

Which (for a low number of output lines faster than the previous code).

0
Jan 28 '16 at 22:46
source share



All Articles