Sort by function using bash / coreutils instead of perl

Question

Sort by function using bash / coreutils instead of perl

I found out that if you sort the list of files by file extension rather than alphabetically before putting them in a tar archive, you can significantly increase the compression ratio (especially for large source trees, where you probably have a lot of .c , .o and .h).

I could not find an easy way to sort files using a shell that works in each case the way I expected. Easy solution like find | rev | sort | rev find | rev | sort | rev find | rev | sort | rev , does the job, but the files are displayed in odd order, and it does not provide them in the best way for the best compression ratio. Other tools, such as ls -X , do not work with find , but sort -t. -k 2,2 -k 1,1 sort -t. -k 2,2 -k 1,1 gets confused when files have more than one period in the file name (for example, version-1.5.tar). Another option with quick n-dirty use of sed replaces the last period with / (which never appears in the file name), then sorts, splitting by / :

 sed 's/\(\.[^.]*\)$/\/\1/' | sort -t/ -k 2,2 -k 1,1 | sed 's/\/\([^/]*\)$/\1/'

However, this does not work again, using the find output, which has names in / in names, and all other characters (except 0) are allowed in file names in * nix.

I found that with Perl you can write your own comparison routine using the same output as cmp (similar to strcmp in C), and then run the perl sort function, passing in your own custom comparison, which was easy to write with perl regular expressions. This is exactly what I did: I now have a perl script that calls

 @lines = <STDIN>; print sort myComparisonFunction @lines;

However, perl is not as portable as bash, so I want to be able to work with a shell script. In addition, find does not put trailing names in a directory, so the script considers the directories to match the files without the extension. Ideally, I would like tar to read all directories first, then regular files (and sort them), and then symbolic links, which I can reach through

 cat <(find -type d) <(find -type f | perl exsort.pl) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar

but I am still faced with the problem that either I have to type this monster every time, or I have both a shell script for this long line and a perl script to sort, and perl is not available everywhere so that stuffing is just one perl script too not a great solution. (I mainly focus on older computers, so currently all modern Linux and OSX come with a fairly modern version of perl).

I would like to combine everything into a single shell script, but I do not know how to pass a user-defined function to the GNU search tool. Am I lucky and need to use one perl script? Or can I do this with a single shell script?

EDIT: Thanks for the idea of transforming Schwartisan. I used a slightly different method using sed . My final sorting procedure is as follows:

 sed 's_^\(\([^/]*/\)*\)\(.*\)\(\.[^\./]*\)$_\4/\3/\1_' | sed 's_^\(\([^/]*/\)*\)\([^\./]\+\)$_/\3/\1_' | sort -t/ -k1,1 -k2,2 -k3,3 | sed 's_^\([^/]*\)/\([^/]*\)/\(.*\)$_\3\2\1_'

This processes special characters (such as *) in file names and first puts files without an extension, because they are often text files. (Makefile, COPYING, README, configure, etc.).

PS If someone wants my original comparison function or think I could improve it, here it is:

 sub comparison { my $first = $a; my $second = $b; my $fdir = $first =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r; my $sdir = $second =~ s/^(([^\/]*\/)*)([^\/]*)$/$1/r; my $fname = $first =~ s/^([^\/]*\/)*([^\/]*)$/$2/r; my $sname = $second =~ s/^([^\/]*\/)*([^\/]*)$/$2/r; my $fbase = $fname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r; my $sbase = $sname =~ s/^(([^\.]*\.)*)([^\.]*)$/$1/r; my $fext = $fname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r; my $sext = $sname =~ s/^([^\.]*\.)*([^\.]*)$/$2/r; if ($fbase eq "" && $sbase ne ""){ return -1; } if ($sbase eq "" && $fbase ne ""){ return 1; } (($fext cmp $sext) or ($fbase cmp $sbase)) or ($fdir cmp $sdir) }

+8

sorting bash regex shell perl

Leo izen Dec 31 '14 at 17:40

source share

3 answers

you can pass the result from find to ls -X using xargs (read the man page here ), which should sort them by extension,

 cat <(find -type d) <(find -type f | xargs ls -X ) <(find -not -type d -and -not -type f) | tar --no-recursion -T - -cvf myfile.tar

+1

dpp Dec 31 '14 at 21:36

source share

To sort by extension to group similar files, and then my md5sum to group identical files:

 find $your_dir | xargs md5sum | sed 's/ /\x00/; s/\.[^.]$/&\x00&/' | sort -t'\0' -k3,3 | cut -d '' -f2

Sorting notes -k3,3 is sorting extensions, and the default sorting by default "last resort" will group files by md5sum.

Also consider xz instead of gz if space bothers you

0

pixelbeat Jun 25 '14 at 15:36

source share

David W. · Accepted Answer · 2013-12-31T23:31:49+0000

If you are familiar with Perl, you can use the Schwartzian Tranform in BASH.

The Schwartz transform simply adds the sort type you need to your sort information, performs the sort, and then deletes the sort key. It was created by Randal Schwartz and is heavily used in Perl. However, it is also useful to use in other languages:

You want to sort files by extension:

 find . -type f 2> /dev/null | while read file #Assuming no strange characters or white space do suffix=${file##*.} printf "%-10.10s %s\n" "$suffix" "$file" done | sort | awk '{print substr( $0, 8 ) }' > files_to_tar.txt

I read every file with find . I use printf to add the suffix file name that I want to sort. Then I will make my choice. My awk removes my sort key, leaving only my file name, which is still sorted by suffix.

Now your files_to_tar.txt file contains the names of your files, sorted by suffix. You can use the -T tar parameter to read the file names from this file:

 $ tar -czvf backup.tar.gz -T files_to_tar.txt

Sort by function using bash / coreutils instead of perl

More articles: