I need to write a program that will write the difference between two files to a file. The program should loop a 600 MB file with more than 13.464.448 lines, check if grep returns true to another file, and then writes the result to another file. I wrote a quick test with about 1,000,000 entries and it took more than an hour, so I guess this approach can take 9 hours.
Do you have any recommendations on how to do this faster? Any specific language I should use? I planned to do this in bash or python.
Thank you very much in advance.
[EDIT 1]: Sorry, when I say the difference between the two files, I did not mean diff. The result file is in a different format.
The logic is a bit like this:
File A has 297,599 lines File B has over 13 million lines
I select the current line read from FILE A, grep on FILE B, and if the line is present in file B, I will write it to the result file. By the way, files A and B have different formats. The result file will be in file format A.
[EDIT 2]: I was asked on the spot to create a bash solution ideally, so we donβt need to install python on all the machines on which this should be executed.
This is my current implementation:
#!/bin/bash LAST_TTP=`ls -ltr TTP_*.txt | tail -1 | awk '{ print $9 }'` LAST_EXP=`ls -ltr *.SSMT | tail -1 | awk '{ print $9 }'` while read -r line; do MATCH="$(grep $line $LAST_EXP)" echo "line: $line, match: $MATCH"
This bash approach takes more than 10 hours. Do you have any suggestions on how to make it more efficient in bash?
Many thanks!