Split a large compressed file into multiple outputs using AWK and BASH

Question

Split a large compressed file into multiple outputs using AWK and BASH

I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if the first field is john_smith, I want the line to be placed in john_smith.gz. NOTE. A string field may contain and contain special characters.

I can do this easily in a for loop across domains using BASH, but I would prefer reading efficiency of a file at a time using AWK.

I tried using a system function in awk with escaped single quotes around the string

zcat large_file.gz | awk '{system ("echo -e'" '"'" $ 1 "\ t" $ 2 "'"' '' | gzip → "$ 1" .gz ");} '

and it works fine on most lines, however, some of them print to STDERR and give an error that the shell cannot execute the command (the shell considers that part of the line is a command). It seems that special characters can break it.

Any thoughts on how to fix this, or any alternative implementations that will help?

Thanks!

-Sean

+4

split linux bash awk gzip

Sean Jul 20 '11 at 18:55

source share

4 answers

You are faced with a great compromise in time and disk space. I assume that you are trying to save space by adding entries to the end of your $ {name} .gz files. @sehe comments and code are definitely worth considering.

In any case, your time is more valuable than 3 GB of disk space. Why not try

  zcat large_file.gz \ | awk '-F\t' { name=$1; string=$2; outFile=name".txt" print name "\t" string >> outFile # close( outFile) }' echo *.txt | xargs gzip -9

You may need to uncomment #close (outFile). Xargs is included because I assume you will have more than 1000 file names. Even if you do not, it will not hurt to use this technique.

Note that this code assumes tab delimited data, change the arg value for -F as necessary and "\ t" in print status to provide the required field separator.

You do not have time to check it. If you like this idea and are stuck, send some sample data, the expected result, and the error messages you receive.

Hope this helps.

+2

shellter Jul 20 '11 at 23:13

source share

Create this program, say largesplitter.c , and use the command

 zcat large_file.gz | largesplitter

Unpainted program:

 #include <errno.h> #include <stdio.h> #include <string.h> int main (void) { char buf [32000]; // todo: resize this if the second field is larger than char cmd [120]; long linenum = 0; while (fgets (buf, sizeof buf, stdin)) { ++linenum; char *cp = strchr (buf, '\t'); // identify first field delimited by tab if (!cp) { fprintf (stderr, "line %d missing delimiter\n", linenum); continue; } *cp = '\000'; // split line FILE *out = fopen (buf, "w"); if (!out) { fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno)); continue; } fprintf (out, "%s", cp+1); fclose (out); snprintf (cmd, sizeof cmd, "gzip %s", buf); system (cmd); } return 0; }

This compiles without errors on my system, but I have not tested its functionality.

0

wallyk Jul 20 '11 at 19:15

source share

Maybe try something in the lines:

zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")

I have not tried it myself, since I do not have large files to play.

0

Jason Jul 20 '11 at 19:17

source share

sehe · Accepted Answer · 2011-07-20T20:44:06+0000

This little perl script does a great job

saving all destination files for performance
Performing elementary error handling
Edit now also outputs via gzip

There is a kludge bit with $fh , because apparently using the hash write directly does not work

 #!/usr/bin/perl use strict; use warnings; my $suffix = ".txt.gz"; my %pipes; while (my ($id, $line) = split /\t/,(<>),2) { exists $pipes{$id} or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") or die "can't open/create $id$suffix, or cannot spawn gzip"; my $fh = $pipes{$id}; print $fh $line; } print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n"

Oh use it like

 zcat input.gz | ./myscript.pl

Split a large compressed file into multiple outputs using AWK and BASH

More articles: