Given that the file is as follows:
1440927 1 1727557 3 1440927 2 9917156 4
The first field is an identifier that is equal to in range(0, 200000000)
. The second field is a type that is equal to in range(1, 5)
. And type 1 and type 2 belong to the general category S1
, and type 3 and type 4 belong to S2
. One identifier can have several records of various types. The file size is about 200 MB.
The task is to count the number of identifiers having a record of type 1 or 2, and the number of identifiers that have a record of type 3 or 4.
My code is:
def gen(path): line_count = 0 for line in open(path): tmp = line.split() id = int(tmp[0]) yield id, int(tmp[1]) max_id = 200000000 S1 = bitarray.bitarray(max_id) S2 = bitarray.bitarray(max_id) for id, type in gen(path): if type != 3 and type != 4: S1[id] = True else: S2[id] = True print S1.count(), S2.count()
Although he gives an answer, I think he works a little slowly. What to do to run it faster?
EDIT: Duplicate entries in the file. And I only need to distinguish between S1 (type 1 and type 2) and S2 (type 3 and type 4). For example, 1440927 1
and 1440927 2
are counted only once, but not twice, because they belong to S1. Therefore, I have to store identifiers.
amazingjxq
source share