How to quickly search through a CSV file in Python

I am reading a file with six million .csv files with Python, and I want to be able to search through this file for a specific record.

Are there any tricks to find the whole file? Do you have to read all this in a dictionary or do a search every time? I tried loading it into the dictionary, but it took a lot of time, so I go through the whole file every time it seems wasteful.

Can I use the list in alphabetical order? (for example, if the search word starts with "b", I look only at the line that includes the first word starting with "b", to the line that includes the last word starting with "b")

I am using import csv .

(side question: can I make csv go to a specific line in the file? I want the program to run on a random line)

Edit: I already have a copy of the list as a .sql file, how can I implement it in Python?

+3
source share
6 answers

If the csv file does not change, upload it to the database, where the search is quick and easy. If you are not familiar with SQL, you will need to figure this out.

Here is a rough example of pasting from csv to sqlite table. The csv example is ';' delimited and has 2 columns.

 import csv import sqlite3 con = sqlite3.Connection('newdb.sqlite') cur = con.cursor() cur.execute('CREATE TABLE "stuff" ("one" varchar(12), "two" varchar(12));') f = open('stuff.csv') csv_reader = csv.reader(f, delimiter=';') cur.executemany('INSERT INTO stuff VALUES (?, ?)', csv_reader) cur.close() con.commit() con.close() f.close() 
+6
source

you can use memory mapping for really large files

 import mmap,os,re reportFile = open( "big_file" ) length = os.fstat( reportFile.fileno() ).st_size try: mapping = mmap.mmap( reportFile.fileno(), length, mmap.MAP_PRIVATE, mmap.PROT_READ ) except AttributeError: mapping = mmap.mmap( reportFile.fileno(), 0, None, mmap.ACCESS_READ ) data = mapping.read(length) pat =re.compile("b.+",re.M|re.DOTALL) # compile your pattern here. print pat.findall(data) 
+4
source

Well, if your words are not too big (that is, they will fit into the memory), then here is an easy way to do this (I assume that these are all words).

 from bisect import bisect_left f = open('myfile.csv') words = [] for line in f: words.extend(line.strip().split(',')) wordtofind = 'bacon' ind = bisect_left(words,wordtofind) if words[ind] == wordtofind: print '%s was found!' % wordtofind 

It may take a minute to load all values ​​from a file. Binary search is used to search for your words. In this case, I was looking for bacon (who would not be looking for bacon?). If there are duplicate values, you can also use bisect_right to find the index 1 behind the rightmost element, which is equal to the value you are looking for. You can still use this if you have key: value pairs. You just need to make each object in the list of words a list of [key, value].

Side note

I do not think that you really can go from line to line in a csv file very easily. You see, these files are basically just long lines with \ n characters indicating new lines.

+1
source

You cannot go directly to a specific line in a file because the lines are of variable length, so the only way to know when line #n starts is to search for the first n lines of a new line. And this is not enough to just search for the characters "\ n", because CSV allows new lines in the cells of the table, so you really need to parse the file anyway.

+1
source

My idea is to use the podon zodb module to store data like dictionaty, and then create a new csv file using this data structure. complete all operations at this time.

0
source

There is a fairly simple way to do this. Depending on how many columns you want to print python, you may need to add or remove some of the print lines. A.

 import csv search=input('Enter string to search: ') stock=open ('FileName.csv', 'wb') reader=csv.reader(FileName) for row in reader: for field in row: if field==code: print('Record found! \n') print(row[0]) print(row[1]) print(row[2]) 

Hope this helps.

0
source

All Articles