Python in memory table data structures for analysis (dict, list, combo)

I am trying to simulate code that works with SQL for me, but use all of Python instead. With some help here is a CSV in a Python dictionary with all column names?

Now I can read my zipped-csv file in a dict. Only one line, although the last. (how to get a sample of rows or the entire data file?)

I hope to have a resident memory table that I can manipulate in the same way as sql when I finished. Clean data by comparing bad data with another table with bad data and correct entries .. then summarize by type the average time period, etc. The total data file is about 500,000 lines. I didn’t fuss about getting everything in memory, but I want to solve the general case as best as possible, so I know what can be done without resorting to SQL

import csv, sys, zipfile sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip" zip_file = zipfile.ZipFile(sys.argv[0]) items_file = zip_file.open('AllListing1RES.txt', 'rU') for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'): pass # Then is my result is >>> for key in row: print 'key=%s, value=%s' % (key, row[key]) key=YEAR_BUILT_DESC, value=EXIST key=SUBDIVISION, value=KNOLLWOOD key=DOM, value=2 key=STREET_NAME, value=ORLEANS RD key=BEDROOMS, value=3 key=SOLD_PRICE, value= key=PROP_TYPE, value=SFR key=BATHS_FULL, value=2 key=PENDING_DATE, value= key=STREET_NUM, value=3828 key=SOLD_DATE, value= key=LIST_PRICE, value=324900 key=AREA, value=200 key=STATUS_DATE, value=3/3/2011 11:54:56 PM key=STATUS, value=A key=BATHS_HALF, value=0 key=YEAR_BUILT, value=1968 key=ZIP, value=35243 key=COUNTY, value=JEFF key=MLS_ACCT, value=492859 key=CITY, value=MOUNTAIN BROOK key=OWNER_NAME, value=SPARKS key=LIST_DATE, value=3/3/2011 key=DATE_MODIFIED, value=3/4/2011 12:04:11 AM key=PARCEL_ID, value=28-15-3-009-001.0000 key=ACREAGE, value=0 key=WITHDRAWN_DATE, value= >>> 

I think I bark some wrong trees here ... One of them is that I have only one line of my file with 500,000 lines. Secondly, it seems that a dict cannot be the right structure, since I don’t think I can just load all 500,000 lines and perform various operations on them. Like .. Sum by group and date .. plus it seems that duplicate keys can cause problems, i.e. unique descriptors such as county and division.

I also don’t know how to read a specific small subset of a line in memory (for example, 10 or 100 for testing, before downloading all (which I also don’t get ..) I read Python docs and several directories, but it just doesn’t click.

It seems that most of the answers I can find suggest using various SQL solutions for this kind of thing, but I really want to learn the basics of achieving similar results with Python. As in some cases, I think it will be easier and faster, as well as expand the toolbox. But it’s hard for me to find suitable examples.

one answer that hints at what I get:

Once the reading is done correctly, DictReader should work to get the strings as dictionaries, a typical string-oriented structure. Oddly enough, this is usually not an efficient way to handle requests like yours; having only column lists makes searching a lot easier. Line orientation means you need to repeat some search work for each line. Things like date matching require data that is certainly not available in the CSV, such as how dates are presented and which columns are dates.

An example of obtaining a column-oriented structure (however, including loading the entire file):

 import csv allrows=list(csv.reader(open('test.csv'))) # Extract the first row as keys for a columns dictionary columns=dict([(x[0],x[1:]) for x in zip(*allrows)]) The intermediate steps of going to list and storing in a variable aren't necessary. The key is using zip (or its cousin itertools.izip) to transpose the table. Then extracting column two from all rows with a certain criterion in column one: matchingrows=[rownum for (rownum,value) in enumerate(columns['one']) if value>2] print map(columns['two'].__getitem__, matchingrows) When you do know the type of a column, it may make sense to parse it, using appropriate functions like datetime.datetime.strptime. 

via Yanne Vernier

Surely there is some good link for this general topic?

+4
source share
3 answers

You can only read one line at a time from the csv reader, but you can store them all in memory quite easily:

 rows = [] for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'): rows.append(row) # rows[0] {'keyA': 13, 'keyB': 'dataB' ... } # rows[1] {'keyA': 5, 'keyB': 'dataB' ... } 

Then, to perform aggregations and calculations:

 sum(row['keyA'] for row in rows) 

You can convert the data before moving to rows , or use a more friendly data structure. Iterating over 500,000 lines for each calculation can be quite inefficient.

As the commentator noted, using a database in memory can be really useful for you. another question asks how to transfer csv data to sqlite database.

 import csv import sqlite3 conn = sqlite3.connect(":memory:") c = conn.cursor() c.execute("create table t (col1 text, col2 float);") # csv.DictReader uses the first line in the file as column headings by default dr = csv.DictReader(open('data.csv', delimiter=',')) to_db = [(i['col1'], i['col2']) for i in dr] c.executemany("insert into t (col1, col2) values (?, ?);", to_db) 
+4
source

You say "" Now I can read my zipped-csv file in a dict. Only one line, although the last. (how do I get a sample string or the entire data file?) "" "

Your code does this:

 for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'): pass 

I can’t imagine why you wrote this, but the effect is to read the entire input file line by line, ignore each line ( pass means "do nothing"). The end result is that row refers to the last line (unless, of course, the file is empty).

To “get” the entire file, change pass to do_something_useful_with(row) .

If you want to read the entire file in memory, simply do the following:

 rows = list(csv.DictReader(.....)) 

To get a sample, for example. every Nth row (N> 0), starting from the Mth row (0 <= M <N), does something like this:

 for row_index, row in enumerate(csv.DictReader(.....)): if row_index % N != M: continue do_something_useful_with(row) 

By the way, you do not need dialect='excel' ; which is the default.

+1
source

Numpy (numerical python) is the best tool for working, comparing arrays, etc., and your table is basically a 2nd array.

0
source

All Articles