How to divide a file by a percentage of a number. lines?

How to divide a file by a percentage of a number. lines?

Say I want to split a file into 3 parts (60% / 20% / 20% parts), I could do it manually, -_-:

$ wc -l brown.txt 57339 brown.txt $ bc <<< "57339 / 10 * 6" 34398 $ bc <<< "57339 / 10 * 2" 11466 $ bc <<< "34398 + 11466" 45864 bc <<< "34398 + 11466 + 11475" 57339 $ head -n 34398 brown.txt > part1.txt $ sed -n 34399,45864p brown.txt > part2.txt $ sed -n 45865,57339p brown.txt > part3.txt $ wc -l part*.txt 34398 part1.txt 11466 part2.txt 11475 part3.txt 57339 total 

But I'm sure the best way!

+8
split file bash awk sed
source share
6 answers

There is a utility that takes as arguments the line numbers that should be the first of each corresponding new file: csplit . This is the wrapper around the POSIX version :

 #!/bin/bash usage () { printf '%s\n' "${0##*/} [-ks] [-f prefix] [-n number] file arg1..." >&2 } # Collect csplit options while getopts "ksf:n:" opt; do case "$opt" in k|s) args+=(-"$opt") ;; # k: no remove on error, s: silent f|n) args+=(-"$opt" "$OPTARG") ;; # f: filename prefix, n: digits in number *) usage; exit 1 ;; esac done shift $(( OPTIND - 1 )) fname=$1 shift ratios=("$@") len=$(wc -l < "$fname") # Sum of ratios and array of cumulative ratios for ratio in "${ratios[@]}"; do (( total += ratio )) cumsums+=("$total") done # Don't need the last element unset cumsums[-1] # Array of numbers of first line in each split file for sum in "${cumsums[@]}"; do linenums+=( $(( sum * len / total + 1 )) ) done csplit "${args[@]}" "$fname" "${linenums[@]}" 

After the name of the file to be divided, it takes the relationship for the size of the divided files relative to their total, i.e.

 percsplit brown.txt 60 20 20 percsplit brown.txt 6 2 2 percsplit brown.txt 3 1 1 

are equivalent.

A usage similar to the case in the question is as follows:

 $ percsplit -s -f part -n 1 brown.txt 60 20 20 $ wc -l part* 34403 part0 11468 part1 11468 part2 57339 total 

Numbering starts from scratch, but there is no txt extension. GNU version supports the --suffix-format option, which allows the extension of .txt and can be added to the accepted arguments, but this will require something more complicated than getopts to parse them.

This solution handles very short files very well (split the file into two lines into two), and csplit does the heavy lifting.

+8
source share
 $ cat file a b c d e $ cat tst.awk BEGIN { split(pcts,p) nrs[1] for (i=1; i in p; i++) { pct += p[i] nrs[int(size * pct / 100) + 1] } } NR in nrs{ close(out); out = "part" ++fileNr ".txt" } { print $0 " > " out } $ awk -v size=$(wc -l < file) -v pcts="60 20 20" -f tst.awk file a > part1.txt b > part1.txt c > part1.txt d > part2.txt e > part3.txt 

Change " > " to > only to the actual write to the output files.

+9
source share
 BEGIN { split(w, weight) total = 0 for (i in weight) { weight[i] += total total = weight[i] } } FNR == 1 { if (NR!=1) { write_partitioned_files(weight,a) split("",a,":") #empty a portably } name=FILENAME } {a[FNR]=$0} END { write_partitioned_files(weight,a) } function write_partitioned_files(weight, a) { split("",threshold,":") size = length(a) for (i in weight){ threshold[length(threshold)] = int((size * weight[i] / total)+0.5)+1 } l=1 part=0 for (i in threshold) { close(out) out = name ".part" ++part for (;l<threshold[i];l++) { print a[l] " > " out } } } 

Call as:

 awk -vw="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ... 

Replace " > " with > in the script to actually write the partitioned files.

The variable w expects whitespace. Files are divided into this proportion. For example, "2 1 1 3" will split the files into four with the number of lines in the ratio 2: 1: 1: 3. Any sequence of numbers adding up to 100 can be used as a percentage.

For large files, array a may consume too much memory. If this is a problem, here is an alternative to awk script:

 BEGIN { split(w, weight) for (i in weight) { total += weight[i]; weight[i] = total #cumulative sum } } FNR == 1 { #get number of lines. take care of single quotes in filename. name = gensub("'", "'\"'\"'", "g", FILENAME) "wc -l '" name "'" | getline size split("", threshold, ":") for (i in weight){ threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1 } part=1; close(out); out = FILENAME ".part" part } { if(FNR>=threshold[part]) { close(out); out = FILENAME ".part" ++part } print $0 " > " out } 

This goes through each file twice. Once to count lines (via wc -l ) and at other times when writing partitioned files. The challenge and effect are similar to the first method.

+1
source share

Using

The following bash script allows you to specify a percentage, for example

 ./split.sh brown.txt 60 20 20 

You can also use a placeholder . which fills the percentage up to 100%.

 ./split.sh brown.txt 60 20 . 

the split file is written to

 part1-brown.txt part2-brown.txt part3-brown.txt 

A script always generates as many part files as a number is given. If percentages add up to 100, cat part* will always generate the source file (no duplicate or missing lines).

Bash Script: split.sh

 #! /bin/bash file="$1" fileLength=$(wc -l < "$file") shift part=1 percentSum=0 currentLine=1 for percent in "$@"; do [ "$percent" == "." ] && ((percent = 100 - percentSum)) ((percentSum += percent)) if ((percent < 0 || percentSum > 100)); then echo "invalid percentage" 1>&2 exit 1 fi ((nextLine = fileLength * percentSum / 100)) if ((nextLine < currentLine)); then printf "" # create empty file else sed -n "$currentLine,$nextLine"p "$file" fi > "part$part-$file" ((currentLine = nextLine + 1)) ((part++)) done 
+1
source share

I like the Benjamin W. csplit solution, but it took so long ...

 #!/bin/bash # usage ./splitpercs.sh file 60 20 20 n=`wc -l <"$1"` || exit 1 echo $* | tr ' ' '\n' | tail -n+2 | head -n`expr $# - 1` | awk -vn=$n 'BEGIN{r=1} {r+=n*$0/100; if(r > 1 && r < n){printf "%d\n",r}}' | uniq | xargs csplit -sfpart "$1" 

(the if(r > 1 && r < n) and uniq bits should prevent the creation of empty files or strange behavior for small percentages, files with a small number of lines or percentages that add more than 100.)

+1
source share

I just followed your guidance and did what you do manually in a script. It may not be the fastest or the β€œbest”, but if you understand what you are doing now, and you can just β€œfix” it, you might be better off if you need to support it.

 #!/bin/bash # thisScript.sh yourfile.txt 20 50 10 20 YOURFILE=$1 shift # changed to cat | wc so I dont have to remove the filename which comes from # wc -l LINES=$(cat $YOURFILE | wc -l ) startpct=0; PART=1; for pct in $@ do # I am assuming that each parameter is on top of the last # so 10 30 10 would become 10, 10+30 = 40, 10+30+10 = 50, ... endpct=$( echo "$startpct + $pct" | bc) # your math but changed parts of 100 instead of parts of 10. # change bc <<< to echo "..." | bc # so that one can capture the output into a bash variable. FIRSTLINE=$( echo "$LINES * $startpct / 100 + 1" | bc ) LASTLINE=$( echo "$LINES * $endpct / 100" | bc ) # use sed every time because the special case for head # doesn't really help performance. sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt $((PART++)) startpct=$endpct done # get the rest if the % dont add to 100% if [[ $( "lastpct < 100" | bc ) -gt 0 ]] ; then sed -n $FIRSTLINE,${LASTLINE}p $YOURFILE > part${PART}.txt fi wc -l part*.txt 
+1
source share

All Articles