How to improve the performance of this counting program?

Question

How to improve the performance of this counting program?

Given that the file is as follows:

1440927 1 1727557 3 1440927 2 9917156 4

The first field is an identifier that is equal to in range(0, 200000000) . The second field is a type that is equal to in range(1, 5) . And type 1 and type 2 belong to the general category S1 , and type 3 and type 4 belong to S2 . One identifier can have several records of various types. The file size is about 200 MB.

The task is to count the number of identifiers having a record of type 1 or 2, and the number of identifiers that have a record of type 3 or 4.

My code is:

 def gen(path): line_count = 0 for line in open(path): tmp = line.split() id = int(tmp[0]) yield id, int(tmp[1]) max_id = 200000000 S1 = bitarray.bitarray(max_id) S2 = bitarray.bitarray(max_id) for id, type in gen(path): if type != 3 and type != 4: S1[id] = True else: S2[id] = True print S1.count(), S2.count()

Although he gives an answer, I think he works a little slowly. What to do to run it faster?

EDIT: Duplicate entries in the file. And I only need to distinguish between S1 (type 1 and type 2) and S2 (type 3 and type 4). For example, 1440927 1 and 1440927 2 are counted only once, but not twice, because they belong to S1. Therefore, I have to store identifiers.

+7

performance python bitarray

amazingjxq Dec 6 '11 at 9:36

source share

3 answers

You use an iterator over the file, this means that you are only buffering several lines at a time. Every time the buffer is empty, you need to look for a disk, and your program should wait.

200 MB fits easily into your memory, so getting all the lines speeds up your work:

 def gen(path): # load all the lines, lines = open(path).readlines() split = (line.split() for line in lines) return ((int(x), int(y)) for x,y in split)

+3

Jochen ritzel Dec 6 '11 at 9:56

source share

Are you attached to Python?

 egrep -e "[12]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l egrep -e "[34]$" filename.txt | cut -d " " -f 1 | sort -u | wc -l

These two commands count the number of occurrences ("1" or "2") and ("3" or "4") at the end of each line of your filename.txt file while ignoring the repeated first fields.

Probably faster than Python ...

+2

eumiro Dec 6 '11 at 9:51

source share

jfs · Accepted Answer · 2011-12-06T10:07:43+0000

If there is enough memory, you can use dict instead of bitarray.bitarray . It could be faster:

 S1, S2 = {}, {} # dicts are slightly faster than `set()` with open(path) as f: for i, line in enumerate(f, 1): id, sep, type = line.partition(" ") if type == "1" or type == "2": S1[id] = True elif type == "3" or type == "4": S2[id] = True else: print "WARNING: unknown type: %r in line %d: %r" % (type, i, line) print len(S1), len(S2)

Or you can try to sort the lines first:

 def gettype(line): return line[-1] S1, S2 = 0, 0 with open(path) as f: lines = f.read().splitlines() lines.sort(key=gettype) for type, group in itertools.groupby(lines, gettype): ids = (line.partition(" ")[0] for line in group) if type == "1" or type == "2": S1 += len(set(ids)) elif type == "3" or type == "4": S2 += len(set(ids)) else: assert 0, (type, list(ids)) print S1, S2

The asymptotic complexity of the second approach is worse.

You can use line_profiler to find out where your bottleneck is.

How to improve the performance of this counting program?

More articles: