Sort by multiple files on Linux

I have several (many) files; each is very large:

file0.txt file1.txt file2.txt 

I do not want to join them in one file, because the resulting file will be 10+ Gigs. Each line in each file contains a 40-byte line. The lines are pretty well ordered right now (about 1:10 steps is a decrease in value instead of an increase).

I would like rows to be ordered. (in place, if possible?) This means that some of the lines from the end of file0.txt will be file0.txt to the beginning of file1.txt and vice versa.

I am working on Linux and quite new to this. I know about the sort command for a single file, but I'm wondering if there is a way to sort multiple files. Or maybe there is a way to make a pseudo file from small files, which linux will consider as a single file.

What I know can do: I can sort each file individually and read in file1.txt to find a value greater than the largest in file0.txt (and similarly grab lines from the end of file0.txt ), join and sort .. but it’s a pain and doesn’t take values ​​from file2.txt to file0.txt (as if it’s unlikely in my case)

Edit

To be clear, if the files look like this:

 f0.txt DDD XXX AAA f1.txt BBB FFF CCC f2.txt EEE YYY ZZZ 

I want this:

 f0.txt AAA BBB CCC f1.txt DDD EEE FFF f2.txt XXX YYY ZZZ 
+7
source share
5 answers

I don't know about a command that performs local sorting, but I think a faster “merge selection” is possible:

 for file in *.txt; do sort -o $file $file done sort -m *.txt | split -d -l 1000000 - output 
  • sort in a for loop ensures that the contents of the input files are sorted. If you do not want to overwrite the original, just change the value after the -o option. (If you expect the files to be sorted already, you can change the sort operator to "check-only": sort -c $file || exit 1 )
  • The second sort efficiently merges the input files while maintaining the sorting of the results.
  • This is passed to the split command, which will then be written to the suffix output files. Pay attention to the symbol - ; this says that the split is read from standard input (i.e. the channel) instead of the file.

It also provides a brief description of how merge sort works:

  • sort reads a line from each file.
  • He arranges these lines and selects the one that should be the first. This line is sent to the output, and a new line is read from the file containing this line.
  • Repeat step 2 until there are no more lines in the file.
  • At this point, the output should be a perfectly sorted file.
  • Profit!
+15
source

This is not exactly what you requested, but the sort(1) utility may help a bit using the --merge option. Sort each file individually, then sort the resulting bunch of files:

 for f in file*.txt ; do sort -o $f < $f ; done sort --merge file*.txt | split -l 100000 - sorted_file 

(This is 100,000 lines per output file. Perhaps this is still too small.)

+5
source

I believe this is your best bet using legacy linux utilities:

  • sort each file separately, for example. for f in file*.txt; do sort $f > sorted_$f.txt; done

  • sort everything using sort -m sorted_file*.txt | split -d -l <lines> - <prefix> sort -m sorted_file*.txt | split -d -l <lines> - <prefix> , where <lines> is the number of lines per file, and <prefix> is the file name prefix. ( -d says split to use numeric suffixes).

The -m for sorting lets it know that the input files are already sorted, so it can be smart.

+3
source

mmap () 3 files, since all lines are 40 bytes long, you can easily sort them in place (SIP :-). Do not forget msync at the end.

+2
source

If the files are sorted separately, you can use sort -m file*.txt to merge them - read the first line of each file, print the smallest and repeat.

+1
source

All Articles