Removing duplicate lines from a txt file

Question

Removing duplicate lines from a txt file

I process large text files (~ 20 MB) containing data separated by a line. Most data records are duplicated, and I want to delete these duplicates in order to keep only one copy.

In addition, to make the problem somewhat more complex, some records are repeated with the addition of an additional bit of information. In this case, I need to save the record containing additional information and delete the old versions.

eg. I need to go from this:

BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY

to that:

JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA MONEY

NB. the final order does not matter.

What is an effective way to do this?

I can use awk, python or any standard linux command line tool.

Thank.

+5

python linux awk

Pete W 09 . '11 17:46

8

awk '{x[$1 " " $2 " " $3] = $0} END {for (y in x) print x[y]}'

:

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) {x[key] = $0}
  }
  END {for (y in x) print y "\t" x[y]}
'

+3

glenn jackman 09 . '11 18:00

:

finalData = {}
for line in input:
    parts = line.split()
    key,extra = tuple(parts[0:3]),parts[3:]
    if key not in finalData or extra:
        finalData[key] = extra

pprint(finalData)

:

{('BOB', '123', '1DB'): ['EXTRA', 'BITS'],
 ('DAVE', '789', '1DB'): [],
 ('JIM', '456', '3DB'): ['AX']}

+2

MikeyB 09 . '11 18:12

glenn jackman :

awk '{idx = $1 " " $2 " " $3; if (length($0) > length(x[idx])) x[idx] = $0} END {for (idx in x) print x[idx]}' inputfile

awk -v ncols=3 '
  {
    key = "";
    for (i=1; i<=ncols; i++) {key = key FS $i}
    if (length($0) > length(x[key])) x[key] = $0
  }
  END {for (y in x) print x[y]}
' inputfile

+2

Dennis Williamson 09 . '11 19:53

, , :

def split_extra(s):
    """Return a pair, the important bits and the extra bits."""
    return blah blah blah

data = {}
for line in open('file'):
    impt, extra = split_extra(line)
    existing = data.setdefault(impt, extra)
    if len(extra) > len(existing):
        data[impt] = extra

out = open('newfile', 'w')
for impt, extra in data.iteritems():
    out.write(impt + extra)

+1

Ned Batchelder 09 . '11 18:06

Since you need extra bits, the fastest way is to create a set of unique records (sort -u will do), and then you have to compare each record with each other, for example.

if x.startswith(y) and not y.startswith(x)

and just leave x and discard y.

+1

Michal Chruszcz Feb 09 '11 at 18:06

source share

If you have perl and want to keep only the last entry:

cat file.txt | perl -ne 'BEGIN{%k={}} @_ = split(/ /);$kw = shift(@_); $kws{$kw} = "@_"; END{ foreach(sort keys %kws){ print "$_ $kws{$_}";} }' > file.new.txt

+1

OneOfOne Feb 09 '11 at 18:07

source share

The function find_unique_lineswill work for a file object or a list of lines.

import itertools

def split_line(s):
    parts = s.strip().split(' ')
    return " ".join(parts[:3]), parts[3:], s

def find_unique_lines(f):
    result = {}
    for key, data, line in itertools.imap(split_line, f):
        if data or key not in result:
            result[key] = line
    return result.itervalues()

test = """BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB
JIM 456 3DB AX
DAVE 789 1DB
BOB 123 1DB EXTRA BITS""".split('\n')

for line in find_unique_lines(test):
        print line

BOB 123 1DB EXTRA BITS
JIM 456 3DB AX
DAVE 789 1DB

+1

shang Feb 09 '11 at 18:22

source share

NPE · Accepted Answer · 2011-02-09T17:59:46+0000

( Python):

prev = None
for line in sorted(open('file')):
  line = line.strip()
  if prev is not None and not line.startswith(prev):
    print prev
  prev = line
if prev is not None:
  print prev

, , Unix sort ( ) script , .

Removing duplicate lines from a txt file

More articles: