Can I sort a huge text file using the Linux sort command by number at the end of each line?

I am trying to sort a text file where the lines are in the following format:

! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 

and you want to sort a numeric number decreasing by the number at the end (for example, 6 in this example). Rows do not have a predictive number of columns, using space as a separator, but using ||| there are always 5 columns as a separator, and in the last column there are always 3 spaces with numbers, the last of which is sorted. The text file is about 15 GB, and I have a perl script that I wrote for this, but it only worked on my old laptop, which had 32 GB of RAM, because perl downloads the whole file at once. Now I'm stuck with 8 gigabyte of RAM, and it just knocks down the swap file for several days. I heard that the standard linux sort command handles huge files more gracefully, but I cannot find a way to get her to use the number at the end.

+7
sorting linux perl
source share
4 answers

It may be a little complicated, but this combination of commands can do this:

 awk '$1=$NF" "$1' file | sort -n | cut -d' ' -f2- 

The main idea is that we print a file that adds the last value at the beginning of the line, then sort and finally remove this value from the output.

  • awk '$1=$NF" "$1' file Since the parameter you want to sort is the last in the file, let it also print in the first field.
  • sort -n Then we go to sort -n , which sorts numerically.
  • cut -d' ' -f2- and finally print the value that we temporarily used.

Test

 $ cat a ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89 $ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2- ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89 

Display of each step:

 $ awk '$1=$NF" "$1' a 6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79 19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19 8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8 89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89 $ awk '$1=$NF" "$1' a | sort -n 6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8 19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19 79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79 89 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89 $ awk '$1=$NF" "$1' a | sort -n | cut -d' ' -f2- ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 6 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 8 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 19 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 79 ! ! ! ! ! ||| ! ||| 1.25846e-05 0.248369 3.02708e-07 0.662955 2.718 ||| 0-0 1-0 2-0 3-0 4-0 ||| 476773 1.98211e+07 89 
+4
source share

It seems that you want to order the file according to the last number, right?

So you can duplicate the last field at the beginning of the line with awk

 awk -F, '{ print $NF, $0 }' prova 

then sort the file with

 sort -n -k1 

and finally remove the fake first field:

 sed 's/^[0-9][0-9]* //' 

Here is the script:

 awk -F, '{ print $NF, $0 }' prova | sort -n -k1 | sed 's/^[0-9][0-9]* //' 
+1
source share

Since the problem is in RAM, perhaps you can reduce the required memory by using Tie::File . This will allow you to refer to a string by its index in the array. You can get the sort numbers and use the Schwartz transform to get a sorted list of indexes, and then just retype the file at the end.

 use strict; use warnings; use Tie::File; my $file = shift; # your filename argument tie my @lines, 'Tie::File', $file or die $!; my @list = map $_->[0], # restore line number sort { $b->[1] <=> $a->[1] } # sort on captured number map { [ $_, $lines[$_] =~ /(\d+)$/ ] } 0 .. $#lines; # store an array ref [ ... ] containing line number and number to # sort by @lines = @lines[@list]; 

The last operation will save the file in sorted order. Please note that this is a constant change, so make backups. This is probably an expensive operation, and Tie::File has some performance issues. Another way to do this, probably less expensive, is to simply iterate over the list of numbers and print line by line into a new file:

 open my $fh, ">", "output.csv" or die $!; for my $num (@list) { print $fh $lines[$num], $/; } 

This print directly to the file bypasses any shell caching needed to redirect the output.

0
source share

Assuming I am allowed to mess up the source file (make a copy otherwise), you can use sorting in the last column by scrolling the file once and turning the last column into a predictable column number. I use the @ symbol as something that I suppose will not be in your data. Everything can be replaced if this is a bad guess.

 sed -i 's/ /@/g; s/@\([^@]*\)$/ \1/;' in.txt # the file now looks like " !@ !@ |||@ whatever@ ||| 6" sort --buffer-size=1G -nk 2 in.txt | sed 's/@/ /g' > sorted.txt 
0
source share

All Articles