Awk sampling without replacement

I have many text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCCT
>DLGKAHOLAGGATACCATAGATGGCACGCCCT
>ELGKAHOLAGGATACCATAGATGGCACGCCCT
>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>JGGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT

Is there a way to make sampling without replacement using awk?

For example, I have this 8 lines, and I only want to selectively select 4 of them in a new file without replacement. The result should look something like this:

>FLGKAHOLAGGATACCATAGATGGCACGCCCT
>POGKAHOLAGGATACCATAGATGGCACGCCCT    
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT

Thank you in advance

+5
source share
4 answers

How about this to randomly sample 10% of your rows?

awk 'rand()>0.9' yourfile1 yourfile2 anotherfile

I'm not sure what you mean by "replacement" ... there is no replacement here, just a random choice.

, 0 1. 0,9, . , 10- , 10. , - , .

(!) srand() , @klashxx

awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
+12

, . shuf sort -R ( POSIX), , n, head.

awk , rand, Mark Setchell.

+3

, , , (), . , 10 100, .

script NUM ( ) FILE:

#!/usr/bin/env bash
# random-samples.sh NUM FILE
# extract NUM random (without replacement) lines from FILE

num=$(( 10#${1:?'Missing sample size'} ))
file="${2:?'Missing file to sample'}"

lines=`wc -l <$file`   # max num of lines in the file

# get_sample MAX
#
# get a random number between 1 .. max
# (see the bash man page on RANDOM

get_sample() {
  local max="$1"
  local rand=$(( ((max * RANDOM) / 32767) + 1 ))
  echo "$rand"
}

# select_line LINE FILE
#
# select line LINE from FILE

select_line() {
  head -n $1 $2 | tail -1
}

declare -A samples     # keep track of samples

for ((i=1; i<=num; i++)) ; do
  sample=
  while [[ -z "$sample" ]]; do
    sample=`get_sample $lines`               # get a new sample
    if [[ -n "${samples[$sample]}" ]]; then  # already used?
      sample=                                # yes, go again
    else
      (( samples[$sample]=1 ))               # new sample, track it
    fi
  done
  line=`select_line $sample $file`           # fetch the sampled line
  printf "%2d: %s\n" $i "$line"
done
exit

:

./random-samples.sh 10 poetry-samples.txt
 1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
 2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
 3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
 4: 5. And miles to go before I sleep 5,350,000 Robert Frost
 5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
 6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
 7: 41. The quality of mercy is not strained 589,000 Shakespeare
 8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
 9: 42. They also serve who only stand and wait 584,000 Milton
10: 48. If you can keep your head when all about you 447,000Kipling

./random-samples.sh 10 poetry-samples.txt
 1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
 2: 34. Busy old fool, unruly sun 675,000 John Donne
 3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
 4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
 5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
 6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
 7: 46. If music be the food of love, play on 507,000 Shakespeare
 8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
 9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
10: 15. But at my back I always hear 2,010,000 Marvell

./random-samples.sh 10 poetry-samples.txt
 1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
 2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
 3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
 4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
 5: 42. They also serve who only stand and wait 584,000 Milton
 6: 24. When in disgrace with fortune and men eyes 1,100,000Shakespeare
 7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
 8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
 9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare
+1

It might be better to try the file using a fixed scheme, for example, fetching one record every 10 lines. You can do this using this awkone line:

awk '0==NR%10' filename

If you want to try a percentage of the total, then you can program a way to calculate the number of lines a awksingle line should use so that the number of printed records matches that number / percentage.

Hope this helps!

0
source

All Articles