How to filter specific lines from a huge CSV file using Python Script

Question

How to filter specific lines from a huge CSV file using Python Script

Is there an efficient way in python to load only certain lines from a huge csv file into memory (for further processing) without burdening the memory?
For example: suppose I want to filter strings from a specific date from a file in the following format, and let this file contain tens or hundreds of concerts (dates are not ordered).

Date event_type country 2015/03/01 impression US 2015/03/01 impression US 2015/03/01 impression CA 2015/03/01 click CA 2015/03/02 impression FR 2015/03/02 click FR 2015/03/02 impression US 2015/03/02 click US

-one

python

Gluz Mar 21 '16 at 13:25

source share

3 answers

max · Answer 1 · 2016-03-21T14:00:52+0000

You still need to process each line in the file in order to validate your proposal. However, there is no need to load the entire file into memory, so you can use streams as follows:

 import csv with open('huge.csv', 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"') for row in spamreader: if row[0] == '2015/03/01': continue # Process data here

If you just need to have a list of matching lines, it’s faster and even easier to use list comprehension as follows:

 import csv with open('huge.csv', 'rb') as csvfile: spamreader = csv.reader(csvfile, delimiter=' ', quotechar='"') rows = [row for row in spamreader if row[0] == '2015/03/01']

Padraic cunningham · Answer 2 · 2016-03-21T14:03:15+0000

If dates can appear anywhere, you have to analyze the whole file:

 import csv def get_rows(k, fle): with open(fle) as f: next(f) for row in csv.reader(f, delimiter=" ", skipinitialspace=1): if row[0] == k: yield row for row in get_rows("2015/03/02", "in.txt"): print(row)

You can use multiprocessing to speed up the parsing of data dividing data into pieces. Got some ideas here

Evandro C. · Answer 3 · 2016-03-21T14:10:04+0000

 import csv filter_countries = {'US': 1} with open('data.tsv', 'r') as f_name: for line in csv.DictReader(f_name, delimiter='\t'): if line['country'] not in filter_countries: print(line)

How to filter specific lines from a huge CSV file using Python Script

More articles: