Removing duplicate lines from a txt file

I process large text files (~ 20 MB) containing data separated by a line. Most data records are duplicated, and I want to delete these duplicates in order to keep only one copy.

In addition, to make the problem somewhat more complex, some records are repeated with the addition of an additional bit of information. In this case, I need to save the record containing additional information and delete the old versions.

eg. I need to go from this:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY
to that:
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY
NB. the final order does not matter.

What is an effective way to do this?

I can use awk, python or any standard linux command line tool.

Thank.

+5
8

( Python):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

, , Unix sort ( ) script , .

+12

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'
+3

:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}
+2

glenn jackman :

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile
+2

, , :

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)
+1

Since you need extra bits, the fastest way is to create a set of unique records (sort -u will do), and then you have to compare each record with each other, for example.

if x.startswith(y) and not y.startswith(x)
and just leave x and discard y.
+1
source

If you have perl and want to keep only the last entry:

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt
+1
source

The function find_unique_lineswill work for a file object or a list of lines.

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line
BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB
+1
source

All Articles