Effective way to transpose a file in Bash

Question

Effective way to transpose a file in Bash

I have a huge partition-delimited file formatted this way

X column1 column2 column3 row1 0 1 2 row2 3 4 5 row3 6 7 8 row4 9 10 11

I would like to efficiently transpose it using only bash commands (I could write ten or so lines of a Perl script for this, but it should be slower to execute than native bash functions). So the result should look like

 X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11

I thought about such a decision

 cols=`head -n 1 input | wc -w` for (( i=1; i <= $cols; i++)) do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output done

But it is slow and does not seem to be the most effective solution. I saw the solution for vi in this post , but it is still too slow. Any thoughts / suggestions / brilliant ideas ?:-)

+99

unix bash parsing transpose

Federico Giorgi Nov 13 '09 at 15:13

source share

25 answers

Another option is to use rs :

 rs -c' ' -C' ' -T

-c changes the input column delimiter, -c changes the output column delimiter, and -T rows and columns. Do not use -T instead of -T because it uses an automatically calculated number of rows and columns, which is usually not correct. rs , which is named after the reshape function in the APL, ships with BSD and OS X, but it should be available to package managers on other platforms.

The second option is to use Ruby:

 ruby -e'puts readlines.map(&:split).transpose.map{|x|x*" "}'

The third option is to use jq :

 jq -R .|jq -sr 'map(./" ")|transpose|map(join(" "))[]'

jq -R . prints each input line as a JSON string literal, -s ( --slurp ) creates an array for the input strings after parsing each line as JSON, and -r ( --raw-output ) displays the contents of the strings instead of JSON string literals. The / operator is overloaded to separate lines.

+40

nisetama May 11 '15 at 17:28

source share

Python solution:

 python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output

The above is based on the following:

 import sys for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())): print(' '.join(c))

This code assumes that each row has the same number of columns (no indentation is performed).

+29

Stephan202 Nov 13 '09 at 17:21

source share

The transpose project on sourceforge is a program like Coreutil for that.

 gcc transpose.c -o transpose ./transpose -t input > output #works with stdin, too.

+20

flying sheep Feb 08 '13 at

source share

Clean BASH, no additional process. Good exercise:

 declare -a array=( ) # we build a 1-D-array read -a line < "$1" # read the headline COLS=${#line[@]} # save number of columns index=0 while read -a line ; do for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < COLS; ROW++ )); do for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do printf "%s\t" ${array[$COUNTER]} done printf "\n" done

+15

Fritz G. Mehner Nov 19 '09 at 15:11

source share

Check out the GNU datamash , which can be used as a datamash transpose . The future version also supports crosstabs (pivot tables)

+11

pixelbeat Jan 07 '16 at 9:08 on

source share

It runs a moderately hard Perl script to do the job. There are many structural counterparts with the @ ghostdog74 awk solution.

 #!/bin/perl -w # # SO 1729824 use strict; my(%data); # main storage my($maxcol) = 0; my($rownum) = 0; while (<>) { my(@row) = split /\s+/; my($colnum) = 0; foreach my $val (@row) { $data{$rownum}{$colnum++} = $val; } $rownum++; $maxcol = $colnum if $colnum > $maxcol; } my $maxrow = $rownum; for (my $col = 0; $col < $maxcol; $col++) { for (my $row = 0; $row < $maxrow; $row++) { printf "%s%s", ($row == 0) ? "" : "\t", defined $data{$row}{$col} ? $data{$row}{$col} : ""; } print "\n"; }

At the size of the sample data, the performance difference between perl and awk was negligible (1 millisecond out of 7 points). With a large dataset (100x100 matrix, records of 6-8 characters each), perl is slightly superior to awk - 0.026s versus 0.042. None of them can be a problem.

Typical timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when specifying "-V") vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 per file containing 10,000 lines with 5 columns per row:

 Osiris JL: time gawk -f tr.awk xxx > /dev/null real 0m0.367s user 0m0.279s sys 0m0.085s Osiris JL: time perl -f transpose.pl xxx > /dev/null real 0m0.138s user 0m0.128s sys 0m0.008s Osiris JL: time awk -f tr.awk xxx > /dev/null real 0m1.891s user 0m0.924s sys 0m0.961s Osiris-2 JL:

Note that gawk is much faster than awk on this machine, but still slower than perl. Obviously your mileage will be different.

+9

Jonathan Leffler Nov 14 '09 at 19:54

source share

If you have sc installed, you can do:

 psc -r < inputfile | sc -W% - > outputfile

+6

Dennis Williamson Nov 13 '09 at 16:54

source share

There is a special utility for this,

GNU datamash utility

 apt install datamash datamash transpose < yourfile

Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods

+6

nelaaro Apr 7 '17 at 9:00

source share

Assuming all your lines have the same number of fields, this awk program solves the problem:

 {for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}

In words, when you loop through lines, for each field f express a line with a separator :: - col[f] containing the elements of this field. After you are done with all the lines, print each of these lines on a separate line. Then you can substitute ':' for the separator you want (say, a space) by supplying output through tr ':' ' ' .

Example:

 $ echo "1 2 3\n4 5 6" 1 2 3 4 5 6 $ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' ' 1 4 2 5 3 6

+5

Guilherme Freitas Jun 10 '15 at 17:57

source share

The hackish perl solution might be like this. This is good because it does not load the entire file into memory, prints intermediate temporary files, and then uses all the wonderful paste

 #!/usr/bin/perl use warnings; use strict; my $counter; open INPUT, "<$ARGV[0]" or die ("Unable to open input file!"); while (my $line = <INPUT>) { chomp $line; my @array = split ("\t",$line); open OUTPUT, ">temp$." or die ("unable to open output file!"); print OUTPUT join ("\n",@array); close OUTPUT; $counter=$.; } close INPUT; # paste files together my $execute = "paste "; foreach (1..$counter) { $execute.="temp$counter "; } $execute.="> $ARGV[1]"; system $execute;

+3

Federico Giorgi Nov 13 '09 at 15:49

source share

The only improvement I see in your own example is the use of awk, which will reduce the number of running processes and the amount of data that is transferred between them:

 /bin/rm output 2> /dev/null cols=`head -n 1 input | wc -w` for (( i=1; i <= $cols; i++)) do awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input done >> output

+3

Simon C Nov 13 '09 at 16:08

source share

I usually use this little awk snippet for this requirement:

  awk '{for (i=1; i<=NF; i++) a[i,NR]=$i max=(max<NF?NF:max)} END {for (i=1; i<=max; i++) {for (j=1; j<=NR; j++) printf "%s%s", a[i,j], (j==NR?RS:FS) } }' file

It just loads all the data into a two-dimensional array a[line,column] and then prints it as a[column,line] , so that it wraps the given input.

To do this, you need to track the number of max imum columns that the source file has, so it is used as the number of lines to print. A.

+3

fedorqui May 12 '15 at 7:48

source share

The GNU datamash is ideal for this task, with just one line of code and potentially arbitrarily large file size!

 datamash -W transpose infile > outfile

+3

Pal 07 Sep '17 at 15:00

source share

I used the fgm solution (thanks fgm!), But I had to remove the tabs at the end of each line, so I changed the script like this:

 #!/bin/bash declare -a array=( ) # we build a 1-D-array read -a line < "$1" # read the headline COLS=${#line[@]} # save number of columns index=0 while read -a line; do for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < COLS; ROW++ )); do for (( COUNTER = ROW; COUNTER < ${#array[@]}; COUNTER += COLS )); do printf "%s" ${array[$COUNTER]} if [ $COUNTER -lt $(( ${#array[@]} - $COLS )) ] then printf "\t" fi done printf "\n" done

+2

dtw Mar 21 '10 at 22:39

source share

I was just looking for a similar bash tranpose, but with padding support. Here is the script I wrote based on the fgm solution, which seems to work. If this can help ...

 #!/bin/bash declare -a array=( ) # we build a 1-D-array declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row SEPARATOR="\t"; PADDING=""; MAXROWS=0; index=0 indexCol=0 while read -a line; do ncols[$indexCol]=${#line[@]}; ((indexCol++)) if [ ${#line[@]} -gt ${MAXROWS} ] then MAXROWS=${#line[@]} fi for (( COUNTER=0; COUNTER<${#line[@]}; COUNTER++ )); do array[$index]=${line[$COUNTER]} ((index++)) done done < "$1" for (( ROW = 0; ROW < MAXROWS; ROW++ )); do COUNTER=$ROW; for (( indexCol=0; indexCol < ${#ncols[@]}; indexCol++ )); do if [ $ROW -ge ${ncols[indexCol]} ] then printf $PADDING else printf "%s" ${array[$COUNTER]} fi if [ $((indexCol+1)) -lt ${#ncols[@]} ] then printf $SEPARATOR fi COUNTER=$(( COUNTER + ncols[indexCol] )) done printf "\n" done

+2

user3251704 Jan 30 '14 at 5:27

source share

Not very elegant, but this "one-line" command quickly solves the problem:

 cols=4; for((i=1;i<=$cols;i++)); do \ awk '{print $'$i'}' input | tr '\n' ' '; echo; \ done

Here cols is the number of columns where you can replace 4 with head -n 1 input | wc -w head -n 1 input | wc -w .

+2

Felipe May 6 '14 at 21:41

source share

I was looking for a solution to transpose any matrix (nxn or mxn) with any data (numbers or data) and got the following solution:

 Row2Trans=number1 Col2Trans=number2 for ((i=1; $i <= Line2Trans; i++));do for ((j=1; $j <=Col2Trans ; j++));do awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i done done paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO

+2

Another.Chemist Aug 6 '14 at 1:10

source share

If you want to grab a single line with comma delimited $ N from a file and turn it into a column:

 head -$N file | tail -1 | tr ',' '\n'

+2

allanbcampbell Nov 06

source share

Another awk solution and the limited memory input you have.

 awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) } END{ for (i in RtoC) print RtoC[i] }' infile

This combines all the same positions with the position together and prints the result in END , which will be the first row in the first column, the second row in the second column, etc. It will display:

 X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11

+2

αғsιη Sep 19 '18 at 16:51

source share

 #!/bin/bash aline="$(head -n 1 file.txt)" set -- $aline colNum=$# #set -x while read line; do set -- $line for i in $(seq $colNum); do eval col$i="\"\$col$i \$$i\"" done done < file.txt for i in $(seq $colNum); do eval echo \${col$i} done

another version with set eval

+1

Dyno Fu Aug 19 '15 at 7:43 on

source share

Some * nix standard one-liners utilities, temporary files are not needed. NB: OP wanted an effective fix (i.e. faster), and better answers are usually faster than this answer. These one-liners are designed for those who love * nix software tools for one reason or another. In rare cases (for example, I / O and memory shortages), these fragments may be faster than some of the most popular answers.

Name the input file foo.

If we know that foo has four columns:

 for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done

If we do not know how many columns foo has:
```
 n=$(head -n 1 foo | wc -w) for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done 
```
xargs has a size limit and therefore can lead to incomplete work with a long file. What size limit is system dependent, for example:
```
 { timeout '.01' xargs --show-limits ; } 2>&1 | grep Max 
```
Maximum team length we could actually use: 2088944

tr & echo :

 for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done

... or if the number of columns is unknown:

 n=$(head -n 1 foo | wc -w) for f in $(seq 1 $n); do cut -d ' ' -f $f foo | tr '\n' ' ' ; echo done

Using set , which, like xargs , has similar command line size limits:

 for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $@ ; done

+1

agc Apr 10 '16 at 10:52 on

source share

Here is the Haskell decision. When compiled with -O2, it runs a little faster than ghostdog awk and a bit slower than Stephan ~~thinly wrapped in~~ python on my machine for repeated "Hello world" input lines. Unfortunately, GHC support for passing command-line code does not exist, as far as I can tell, so you have to write it to a file yourself. It truncates the lines to the length of the shortest line.

 transpose :: [[a]] -> [[a]] transpose = foldr (zipWith (:)) (repeat []) main :: IO () main = interact $ unlines . map unwords . transpose . map words . lines

0

stelleg Aug 26 '14 at 3:03

source share

The following is one paste -linking them together:

 echo '' > tmp1; \ cat m.txt | while read l ; \ do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \ cp tmp2 tmp1; \ done; \ cat tmp1

m.txt:

 0 1 2 4 5 6 7 8 9 10 11 12

creates a tmp1 file tmp1 that it is not empty.
reads each row and converts it to a column using tr
inserts a new column into the tmp1 file
copies the result back to tmp1 .

PS: I really wanted to use io descriptors, but couldn't make them work.

0

kirill_igum Dec 07 '14 at 3:08

source share

Awk solution that stores the entire array in memory

  awk '$0!~/^$/{ i++; split($0,arr,FS); for (j in arr) { out[i,j]=arr[j]; if (maxr<j){ maxr=j} # max number of output rows. } } END { maxc=i # max number of output columns. for (j=1; j<=maxr; j++) { for (i=1; i<=maxc; i++) { printf( "%s:", out[i,j]) } printf( "%s\n","" ) } }' infile

But we can “walk” the file as many times as the output lines are needed:

 #!/bin/bash maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)" rowcount=maxf for (( i=1; i<=rowcount; i++ )); do awk -vi="$i" -F " " '{printf("%s\t ", $i)}' infile echo done

Which (for a low number of output lines faster than the previous code).

0

user2350426 Jan 28 '16 at 22:46

source share

ghostdog74 · Accepted Answer · 2009-11-13 15:34

 awk ' { for (i=1; i<=NF; i++) { a[NR,i] = $i } } NF>p { p = NF } END { for(j=1; j<=p; j++) { str=a[1,j] for(i=2; i<=NR; i++){ str=str" "a[i,j]; } print str } }' file

Exit

 $ more file 0 1 2 3 4 5 6 7 8 9 10 11 $ ./shell.sh 0 3 6 9 1 4 7 10 2 5 8 11

Performance versus Jonathan Perl solution in 10,000 line file

 $ head -5 file 1 0 1 2 2 3 4 5 3 6 7 8 4 9 10 11 1 0 1 2 $ wc -l < file 10000 $ time perl test.pl file >/dev/null real 0m0.480s user 0m0.442s sys 0m0.026s $ time awk -f test.awk file >/dev/null real 0m0.382s user 0m0.367s sys 0m0.011s $ time perl test.pl file >/dev/null real 0m0.481s user 0m0.431s sys 0m0.022s $ time awk -f test.awk file >/dev/null real 0m0.390s user 0m0.370s sys 0m0.010s

EDIT by Ed Morton (@ ghostdog74 feel free to delete if you reject).

Perhaps this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script does. It also uses tabs as a separator, originally requested by the OP to handle empty fields, and it coincides to exaggerate the output for this particular case.

 $ cat tst.awk BEGIN { FS=OFS="\t" } { for (rowNr=1;rowNr<=NF;rowNr++) { cell[rowNr,NR] = $rowNr } maxRows = (NF > maxRows ? NF : maxRows) maxCols = NR } END { for (rowNr=1;rowNr<=maxRows;rowNr++) { for (colNr=1;colNr<=maxCols;colNr++) { printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS) } } } $ awk -f tst.awk file X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11

The above solutions will work in any awk (except for the old, broken awk, of course, there is YMMV).

The above solutions really read the entire file in memory - if the input files are too large for this, you can do this:

 $ cat tst.awk BEGIN { FS=OFS="\t" } { printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND } ENDFILE { print "" if (ARGIND < NF) { ARGV[ARGC] = FILENAME ARGC++ } } $ awk -f tst.awk file X row1 row2 row3 row4 column1 0 3 6 9 column2 1 4 7 10 column3 2 5 8 11

which uses almost no memory, but reads the input file once per the number of fields in the line, so it will be much slower than the version that reads the entire file into memory. He also assumes the number of fields is the same for each row and uses GNU awk for ENDFILE and ARGIND , but any awk can do the same with tests for FNR==1 and END .

Effective way to transpose a file in Bash

More articles: