I process large text files (~ 20 MB) containing data separated by a line. Most data records are duplicated, and I want to delete these duplicates in order to keep only one copy.
In addition, to make the problem somewhat more complex, some records are repeated with the addition of an additional bit of information. In this case, I need to save the record containing additional information and delete the old versions.
eg. I need to go from this:
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY
to that:JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY
NB. the final order does not matter.What is an effective way to do this?
I can use awk, python or any standard linux command line tool.
Thank.