CSV for Python Dictionary with all column names?

I'm still pretty new to using python for programs from scratch, so I would take a file that I process with SQL to duplicate functions with Python. It seems that I want to take the (compressed, zip) csv file and create a Dict from it (or maybe a dict of dicts?). When I use a dict reader, do I get the first row as a key, and not each column as my own key? For instance.

import csv, sys, zipfile sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip" zip_file = zipfile.ZipFile(sys.argv[0]) items_file = zip_file.open('AllListing1RES.txt', 'rU') for row in csv.DictReader(items_file,dialect='excel'): pass 

Productivity:

 >>> for key in row: print 'key=%s, value=%s' % (key, row[key]) key=MLS_ACCT PARCEL_ID AREA COUNTY STREET_NUM STREET_NAME CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE, value=492859 28-15-3-009-001.0000 200 JEFF 3828 ORLEANS RD MOUNTAIN BROOK 35243 A SFR 324900 3/3/2011 2 3/4/2011 12:04:11 AM 0 2 3 0 1968 EXIST SPARKS 3/3/2011 11:54:56 PM KNOLLWOOD CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE, >>> for key in row: print 'key=%s, value=%s' % (key, row[key]) key=MLS_ACCT PARCEL_ID AREA COUNTY STREET_NUM STREET_NAME CITY ZIP STATUS PROP_TYPE LIST_PRICE LIST_DATE DOM DATE_MODIFIED BATHS_HALF BATHS_FULL BEDROOMS ACREAGE YEAR_BUILT YEAR_BUILT_DESC OWNER_NAME SOLD_DATE WITHDRAWN_DATE STATUS_DATE SUBDIVISION PENDING_DATE SOLD_PRICE, value=492859 28-15-3-009-001.0000 200 JEFF 3828 ORLEANS RD MOUNTAIN BROOK 35243 A SFR 324900 3/3/2011 2 3/4/2011 12:04:11 AM 0 2 3 0 1968 EXIST SPARKS 3/3/2011 11:54:56 PM KNOLLWOOD 

So, I'm looking for a column for MLS_ACCT and a separate one for PARCEL_ID , etc., so I can do things like average prices for all elements that contain KNOLLWOOD in the SUBDIVISION field with an additional subsection by date, sale date, etc.

I know well how to do this with SQL, but as I said, I am contacting to get some Python skills here. I read the last few days, but have not yet found very simple illustrations in this case. Pointers to the mentioned documents would be appreciated. I understand that I can use resident SQL-Lite, but again, my desire is to get an approach to Python. I read some of them about Numpy and Scipy and downloaded the sage, but still can not find useful illustrations, as these tools focus on arrays with numbers only as elements, and I have a lot of string mappings that I need to do as well date calculations and comparisons.

In the end, I will need to replace the values โ€‹โ€‹in the table (since I have dirty data), I am doing it now, having a "translation table" that contains all the dirty options, and provides a "clean" answer for final use.

+4
source share
3 answers

Are you sure this is a comma-delimited file? The lines seem to be tabbed separated.

If this is correct, specify the tab delimiter in the DictReader constructor.

 for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'): for key in row: print 'key=%s, value=%s' % (key, row[key]) 

Source: http://docs.python.org/library/csv.html

+6
source

Writing an operation in pure Python is certainly possible, but then you have to choose your own algorithms. The output of the line you posted above looks very much like the parsing failed; in fact, it seems this is not a CSV at all, is it a TSV? Try passing delimiter='\t' or dialect=csv.excel_tab to DictReader.

Once the reading is done correctly, DictReader should work to get the strings as dictionaries, a typical string-oriented structure. Oddly enough, this is usually not an efficient way to handle requests like yours; having only column lists makes searching a lot easier. Line orientation means you need to repeat some search work for each line. Things like date matching require data that is certainly not available in the CSV, such as how dates are presented and which columns are dates.

An example of obtaining a column-oriented structure (however, including loading the entire file):

 import csv allrows=list(csv.reader(open('test.csv'))) # Extract the first row as keys for a columns dictionary columns=dict([(x[0],x[1:]) for x in zip(*allrows)]) 

The intermediate steps of moving to a list and storing in a variable are not needed. The key uses zip (or its cousin itertools.izip) to wrap the table.

Then retrieve the column two from all rows with a specific criterion in the column one :

 matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2] print map(columns['two'].__getitem__, matchingrows) 

When you know the type of a column, it makes sense to parse it using appropriate functions such as datetime.datetime.strptime .

+1
source

At first glance it seems that your input may not be CSV, but perhaps a tab delimiter is used instead. Browse the docs in python.org , you can create a dialect and use it to change the delimiter.

 import csv csv.register_dialect('exceltab', delimiter='\t') for row in csv.DictReader(items_file,dialect='exceltab'): pass 
0
source

All Articles