BEGIN { split(w, weight) total = 0 for (i in weight) { weight[i] += total total = weight[i] } } FNR == 1 { if (NR!=1) { write_partitioned_files(weight,a) split("",a,":")
Call as:
awk -vw="60 20 20" -f above_script.awk file_to_split1 file_to_split2 ...
Replace " > " with > in the script to actually write the partitioned files.
The variable w expects whitespace. Files are divided into this proportion. For example, "2 1 1 3" will split the files into four with the number of lines in the ratio 2: 1: 1: 3. Any sequence of numbers adding up to 100 can be used as a percentage.
For large files, array a may consume too much memory. If this is a problem, here is an alternative to awk script:
BEGIN { split(w, weight) for (i in weight) { total += weight[i]; weight[i] = total #cumulative sum } } FNR == 1 { #get number of lines. take care of single quotes in filename. name = gensub("'", "'\"'\"'", "g", FILENAME) "wc -l '" name "'" | getline size split("", threshold, ":") for (i in weight){ threshold[length(threshold)+1] = int((size * weight[i] / total)+0.5)+1 } part=1; close(out); out = FILENAME ".part" part } { if(FNR>=threshold[part]) { close(out); out = FILENAME ".part" ++part } print $0 " > " out }
This goes through each file twice. Once to count lines (via wc -l ) and at other times when writing partitioned files. The challenge and effect are similar to the first method.
pii_ke
source share