Bash - swap values ​​in a column

I have some CSV / tabular data in a file, for example:

1,7,3,2 8,3,8,0 4,9,5,3 8,5,7,3 5,6,1,9 

(They are not always numbers, just random values, separated by commas. Single-digit numbers are easier for example.)

I want to shuffle a random 40% of any of the columns. For example, say the third. Therefore, perhaps 3 and 1 are interchanged with each other. Now the third column:

 1 << Came from the last position 8 5 7 3 << Came from the first position 

I am trying to do this in place in a file from a bash script that I am working on, and I am not very lucky. I keep wandering around the pretty crazy and barren rabbit holes that leave me for being wrong (constant failure is what pushes me away).

I noted this question with a lot of things, because I'm not quite sure which tool I should use for this.

Edit: I'm probably going to end up accepting Rubens's answer, no matter how silly it is, because it has its own concept of exchange (which, I think, I could emphasize more in the original question), and this allows me to point out percentage of the column to exchange. It also works, which is always a plus.

For someone who doesn't need this and just wants to play a regular game, Jim Harrison's answer also works (I tested it).

A word of warning, however, is about Rubens' decision. I accepted this:

 for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : ""; ... } printf "\n"; 

deleted printf "\n"; and moved the newline character as follows:

 for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : "\n"; ... } 

because just having "" in the else case led awk write broken characters at the end of each line ( \00 ). At some point, he even managed to replace my entire file with Chinese characters. Although, frankly, this probably influenced the fact that I was doing something superfluous, stupid, on this problem.

+6
source share
3 answers

Algorithm

  • create a vector with pairs n , from 1 to number of lines and the corresponding value in the row (for the selected column), and then sort it randomly;
  • find how many lines should be randomized: num_random = percentage * num_lines / 100 ;
  • select the first num_random entries from your randomized vector;
  • You can sort the selected rows randomly, but they should be sorted randomly;
  • print output:

     i = 0 for num_line, value in column; do if num_line not in random_vector: print value; # printing non-randomized value else: print random_vector[i]; # randomized entry i++; done 

Implementation

 #! /bin/bash infile=$1 col=$2 n_lines=$(wc -l < ${infile}) prob=$(bc <<< "$3 * ${n_lines} / 100") # Selected lines tmp=$(tempfile) paste -d ',' <(seq 1 ${n_lines}) <(cut -d ',' -f ${col} ${infile}) \ | sort -R | head -n ${prob} > ${tmp} # Rewriting file awk -v "col=$col" -F "," ' (NR == FNR) {id[$1] = $2; next} (FNR == 1) { i = c = 1; for (v in id) {value[i] = id[v]; ++i;} } { for (i = 1; i <= NF; ++i) { delim = (i != NF) ? "," : ""; if (i != col) {printf "%s%c", $i, delim; continue;} if (FNR in id) {printf "%s%c", value[c], delim; c++;} else {printf "%s%c", $i, delim;} } printf "\n"; } ' ${tmp} ${infile} rm ${tmp} 

If you need a close approach to placement, you can send the output back to the input file using sponge .

Execution

To execute, simply use:

 $ ./script.sh <inpath> <column> <percentage> 

How in:

 $ ./script.sh infile 3 40 1,7,3,2 8,3,8,0 4,9,1,3 8,5,7,3 5,6,5,9 

Conclusion

This allows you to select a column, randomly sort the percentage of records in that column, and replace a new column in the source file.

This script goes as proof, like no one else, and not only that shell scripts are extremely interesting, but there are times when they should definitely not be used. (

+1
source

This will work for a specially designated column, but should be enough to point you in the right direction. This works on modern bash shells, including Cygwin:

 paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 

Operational function: process replacement . "

The paste command merges the files horizontally, and the three parts are split into the source file using cut , and the second part (a randomized column) is executed by the shuf command to change the order of the lines. Here, the output is executed a couple of times:

 $ cat test.dat 1,7,3,2 8,3,8,0 4,9,5,3 8,5,7,3 5,6,1,9 $ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 1,7,1,2 8,3,8,0 4,9,7,3 8,5,3,3 5,6,5,9 $ paste -d, <(cut -d, -f1-2 test.dat) <(cut -d, -f3 test.dat|shuf) <(cut -d, -f4- test.dat) 1,7,8,2 8,3,1,0 4,9,3,3 8,5,7,3 5,6,5,9 
+4
source

I would use a two-pass approach that starts by counting the number of lines and reading the file into an array, and then use the awk rand () function to generate random numbers to identify the lines that you will change and then rand () again to determine which pairs of these you change the lines, and then change the elements of the array before printing. Something like this PSEUDO-CODE, crude algorithm:

 awk -F, -v pct=40 -v col=3 ' NR == FNR { array[++totNumLines] = $0 next } FNR == 1{ pctNumLines = totNumLines * pct / 100 srand() for (i=1; i<=(pctNumLines / 2); i++) { oldLineNr = rand() * some factor to produce a line number that in the 1 to totNumLines range but is not already recorded as processed in the "swapped" array. newLineNr = ditto plus must not equal oldLineNr swap field $col between array[oldLineNr] and array[newLineNr] swapped[oldLineNr] swapped[newLineNr] } next } { print array[FNR] } ' "$file" "$file" > tmp && mv tmp "$file" 
0
source

All Articles