Comparing two large files

I need to write a program that will write the difference between two files to a file. The program should loop a 600 MB file with more than 13.464.448 lines, check if grep returns true to another file, and then writes the result to another file. I wrote a quick test with about 1,000,000 entries and it took more than an hour, so I guess this approach can take 9 hours.

Do you have any recommendations on how to do this faster? Any specific language I should use? I planned to do this in bash or python.

Thank you very much in advance.

[EDIT 1]: Sorry, when I say the difference between the two files, I did not mean diff. The result file is in a different format.

The logic is a bit like this:

File A has 297,599 lines File B has over 13 million lines

I select the current line read from FILE A, grep on FILE B, and if the line is present in file B, I will write it to the result file. By the way, files A and B have different formats. The result file will be in file format A.

[EDIT 2]: I was asked on the spot to create a bash solution ideally, so we don’t need to install python on all the machines on which this should be executed.

This is my current implementation:

#!/bin/bash LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'` LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'` while read -r line; do MATCH="$(grep $line $LAST_EXP)" echo "line: $line, match: $MATCH" # if not empty if [ ! -z "$MATCH" ] then echo $MATCH >> result fi done < $LAST_TTP 

This bash approach takes more than 10 hours. Do you have any suggestions on how to make it more efficient in bash?

Many thanks!

+4
source share
7 answers

You are probably looking in the list instead of typing, which leads to O (nΒ²) performance. Try:

 with open('b') as b: blines = set(b) with open('a') as a: with open('result', 'w') as result: for line in a: if line not in blines: result.write(line) 

Assuming uniformly long (and not too long lines), the performance of this implementation is in O(|A| + |B|) (it is amortized, because of the Pyton set is extremely fast ). The memory requirement is in O(|B|) , but with a coefficient well above 1.

If the order of the lines in the output does not matter, you can also sort both files and then compare them in turn. This will have a performance of the order of O(|A| log |A| + B log |B|) . The memory requirement will be in O(|A|+|B|) , more precisely, |A| + |B| .

+9
source

Sort each input file. Now read one line from each. If one line is compared smaller than the other, output it as a difference and read the next line from this file. If both lines are compared the same, read the next line from both files. If you read to the end of one file, all lines in another file are differences.

This is the O (n log n) algorithm, unlike the O (n ^ 2) algorithm you started with.

+4
source

Your implementation is as follows:

 grep --fixed-strings --file=file_B file_A > result_file 

But the answers of @phihag and @mark Ronsam are the best solutions.

In addition, if you want to use heavy guns, your solution is a good candidate for downsized frames such as hadoop.

+2
source

The connect command also does what you want. join requires both files to be sorted in front. The -v option prints a line for each line that is not covered.

So you need something like

join -v 1 sortedfile1 sortedfile2

(you will need to set the connection parameters according to your file format, see the connection man page)

In the example below, the files test1.txt and test2.txt are combined using the second or. first field to connect. Assuming the files were sorted in advance using the sort command. The -v 1 option prints only the lines test1.txt cannot be combined.

  > cat test1.txt
 a 1 2
 b 2 3

 > cat test2.txt
 1 x
 4 x

 > join -v 1 -1 2 -2 -2 1 test1.txt test2.txt
 2 b 3

 > join -v 1 -1 2 -2 -1 -o 1.1 1.2 1.3 test1.txt test2.txt
 b 2 3

Sort and merge both files work great with large files.

+2
source

You can speed up the script a bit by stopping grep when it finds the first match, if that suits your needs.

If your grep supports it, use grep -m 1 .

Your problem is that you create grep almost 300,000 times and read over 13,000,000 lines each time.

Stopping grep in the first match will help a little, but the overhead of all these execs is also a big factor. Implementation in Python will fix this problem.

Regarding the selection of files in your script, see BashFAQ / 003 and Parse ls .

+2
source

I would consider the comm command. It should be faster than grep, because it will work with sorted data, while grep will always do linear searches.

 comm -2 -3 <(sort file1) <(sort file2) 
0
source

You can also use awk:

awk 'NR==FNR { arrA[$0]; next } $0 in arrA' file_A file_B > result

The order of the files (... file_A file_B) on the command line is very important.

0
source

Source: https://habr.com/ru/post/1414916/


All Articles