I found out that if you sort the list of files by file extension rather than alphabetically before putting them in a tar archive, you can significantly increase the compression ratio (especially for large source trees, where you probably have a lot of .c , .o and .h).
I could not find an easy way to sort files using a shell that works in each case the way I expected. Easy solution like find | rev | sort | rev find | rev | sort | rev find | rev | sort | rev , does the job, but the files are displayed in odd order, and it does not provide them in the best way for the best compression ratio. Other tools, such as ls -X , do not work with find , but sort -t. -k 2,2 -k 1,1 sort -t. -k 2,2 -k 1,1 gets confused when files have more than one period in the file name (for example, version-1.5.tar). Another option with quick n-dirty use of sed replaces the last period with / (which never appears in the file name), then sorts, splitting by / :
sed 's/\(\.[^.]*\)$/\/\1/' | sort -t/ -k 2,2 -k 1,1 | sed 's/\/\([^/]*\)$/\1/'
However, this does not work again, using the find output, which has names in / in names, and all other characters (except 0) are allowed in file names in * nix.
I found that with Perl you can write your own comparison routine using the same output as cmp (similar to strcmp in C), and then run the perl sort function, passing in your own custom comparison, which was easy to write with perl regular expressions. This is exactly what I did: I now have a perl script that calls
@lines = <STDIN>; print sort myComparisonFunction @lines;
However, perl is not as portable as bash, so I want to be able to work with a shell script. In addition, find does not put trailing names in a directory, so the script considers the directories to match the files without the extension. Ideally, I would like tar to read all directories first, then regular files (and sort them), and then symbolic links, which I can reach through
cat <(find -type d) <(find -type f | perl exsort.pl) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar
but I am still faced with the problem that either I have to type this monster every time, or I have both a shell script for this long line and a perl script to sort, and perl is not available everywhere so that stuffing is just one perl script too not a great solution. (I mainly focus on older computers, so currently all modern Linux and OSX come with a fairly modern version of perl).
I would like to combine everything into a single shell script, but I do not know how to pass a user-defined function to the GNU search tool. Am I lucky and need to use one perl script? Or can I do this with a single shell script?
EDIT: Thanks for the idea of transforming Schwartisan. I used a slightly different method using sed . My final sorting procedure is as follows:
sed 's_^\(\([^/]*/\)*\)\(.*\)\(\.[^\./]*\)$_\4/\3/\1_' | sed 's_^\(\([^/]*/\)*\)\([^\./]\+\)$_/\3/\1_' | sort -t/ -k1,1 -k2,2 -k3,3 | sed 's_^\([^/]*\)/\([^/]*\)/\(.*\)$_\3\2\1_'
This processes special characters (such as *) in file names and first puts files without an extension, because they are often text files. (Makefile, COPYING, README, configure, etc.).
PS If someone wants my original comparison function or think I could improve it, here it is:
sub comparison { my $first = $a; my $second = $b; my $fdir = $first =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r; my $sdir = $second =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r; my $fname = $first =~ s/^([^\/]*\/)*([^\/]*)$/$2/r; my $sname = $second =~ s/^([^\/]*\/)*([^\/]*)$/$2/r; my $fbase = $fname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r; my $sbase = $sname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r; my $fext = $fname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r; my $sext = $sname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r; if ($fbase eq "" && $sbase ne ""){ return -1; } if ($sbase eq "" && $fbase ne ""){ return 1; } (($fext cmp $sext) or ($fbase cmp $sbase)) or ($fdir cmp $sdir) }