Split a large compressed file into multiple outputs using AWK and BASH

I have a large (3GB), gzipped file containing two fields: NAME and STRING. I want to split this file into smaller files - if the first field is john_smith, I want the line to be placed in john_smith.gz. NOTE. A string field may contain and contain special characters.

I can do this easily in a for loop across domains using BASH, but I would prefer reading efficiency of a file at a time using AWK.

I tried using a system function in awk with escaped single quotes around the string

zcat large_file.gz | awk '{system ("echo -e'" '"'" $ 1 "\ t" $ 2 "'"' '' | gzip β†’ "$ 1" .gz ");} '

and it works fine on most lines, however, some of them print to STDERR and give an error that the shell cannot execute the command (the shell considers that part of the line is a command). It seems that special characters can break it.

Any thoughts on how to fix this, or any alternative implementations that will help?

Thanks!

-Sean

+4
source share
4 answers

This little perl script does a great job

  • saving all destination files for performance
  • Performing elementary error handling
  • Edit now also outputs via gzip

There is a kludge bit with $fh , because apparently using the hash write directly does not work

 #!/usr/bin/perl use strict; use warnings; my $suffix = ".txt.gz"; my %pipes; while (my ($id, $line) = split /\t/,(<>),2) { exists $pipes{$id} or open ($pipes{$id}, "|gzip -9 > '$id$suffix'") or die "can't open/create $id$suffix, or cannot spawn gzip"; my $fh = $pipes{$id}; print $fh $line; } print STDERR "Created: " . join(', ', map { "$_$suffix" } keys %pipes) . "\n" 

Oh use it like

 zcat input.gz | ./myscript.pl 
0
source

You are faced with a great compromise in time and disk space. I assume that you are trying to save space by adding entries to the end of your $ {name} .gz files. @sehe comments and code are definitely worth considering.

In any case, your time is more valuable than 3 GB of disk space. Why not try

  zcat large_file.gz \ | awk '-F\t' { name=$1; string=$2; outFile=name".txt" print name "\t" string >> outFile # close( outFile) }' echo *.txt | xargs gzip -9 

You may need to uncomment #close (outFile). Xargs is included because I assume you will have more than 1000 file names. Even if you do not, it will not hurt to use this technique.

Note that this code assumes tab delimited data, change the arg value for -F as necessary and "\ t" in print status to provide the required field separator.

You do not have time to check it. If you like this idea and are stuck, send some sample data, the expected result, and the error messages you receive.

Hope this helps.

+2
source

Create this program, say largesplitter.c , and use the command

 zcat large_file.gz | largesplitter 

Unpainted program:

 #include <errno.h> #include <stdio.h> #include <string.h> int main (void) { char buf [32000]; // todo: resize this if the second field is larger than char cmd [120]; long linenum = 0; while (fgets (buf, sizeof buf, stdin)) { ++linenum; char *cp = strchr (buf, '\t'); // identify first field delimited by tab if (!cp) { fprintf (stderr, "line %d missing delimiter\n", linenum); continue; } *cp = '\000'; // split line FILE *out = fopen (buf, "w"); if (!out) { fprintf (stderr, "error creating '%s': %s\n", buf, strerror(errno)); continue; } fprintf (out, "%s", cp+1); fclose (out); snprintf (cmd, sizeof cmd, "gzip %s", buf); system (cmd); } return 0; } 

This compiles without errors on my system, but I have not tested its functionality.

0
source

Maybe try something in the lines:

zcat large_file.gz | echo $("awk '{system("echo -e '"'"'"$1"\t"$2"'"'"' | gzip >> "$1".gz");}'")

I have not tried it myself, since I do not have large files to play.

0
source

All Articles