Using Pandas, how do I deduplicate a file that is read in chunks?

I have a large fixed-width file that is read in pandas in pieces of 10,000 lines. This works great for everything except deleting duplicates from data, because duplicates can obviously be in different pieces. A file is read in chunks because it is too large to fit into the entire memory.

My first attempt to deduplicate a file was to enter only the two columns needed for deduplication and make a list of lines to read. Reading only in these two columns (about 500) fits easily into memory, and I was able to use the id column to find duplicates and the correspondence column to decide which of two or three with the same identifier to keep. Then I used the skiprows flag of the read_fwf () command to skip these lines.

The problem I ran into is that the pandas fixed-width reader does not work with skiprows = [list] and iterator = True at the same time.

So how do I deduplicate a processed file in chunks?

+4
source share
1 answer

, , , , . , , , , , . , .

, "id". , . DataFrame.duplicated() , , ~ . "".

dupemask = ~df.duplicated(subset = ['id'])

, . . "dupemask", , , .

for i, df in enumerate(chunked_data_iterator):
    df.index = range(i*chunksize, i*chunksize + len(df.index))
    df = df[dupemask]

, , . , .

0

All Articles