I am trying to simulate code that works with SQL for me, but use all of Python instead. With some help here is a CSV in a Python dictionary with all column names?
Now I can read my zipped-csv file in a dict. Only one line, although the last. (how to get a sample of rows or the entire data file?)
I hope to have a resident memory table that I can manipulate in the same way as sql when I finished. Clean data by comparing bad data with another table with bad data and correct entries .. then summarize by type the average time period, etc. The total data file is about 500,000 lines. I didn’t fuss about getting everything in memory, but I want to solve the general case as best as possible, so I know what can be done without resorting to SQL
import csv, sys, zipfile sys.argv[0] = "/home/tom/Documents/REdata/AllListing1RES.zip" zip_file = zipfile.ZipFile(sys.argv[0]) items_file = zip_file.open('AllListing1RES.txt', 'rU') for row in csv.DictReader(items_file, dialect='excel', delimiter='\t'): pass
I think I bark some wrong trees here ... One of them is that I have only one line of my file with 500,000 lines. Secondly, it seems that a dict cannot be the right structure, since I don’t think I can just load all 500,000 lines and perform various operations on them. Like .. Sum by group and date .. plus it seems that duplicate keys can cause problems, i.e. unique descriptors such as county and division.
I also don’t know how to read a specific small subset of a line in memory (for example, 10 or 100 for testing, before downloading all (which I also don’t get ..) I read Python docs and several directories, but it just doesn’t click.
It seems that most of the answers I can find suggest using various SQL solutions for this kind of thing, but I really want to learn the basics of achieving similar results with Python. As in some cases, I think it will be easier and faster, as well as expand the toolbox. But it’s hard for me to find suitable examples.
one answer that hints at what I get:
Once the reading is done correctly, DictReader should work to get the strings as dictionaries, a typical string-oriented structure. Oddly enough, this is usually not an efficient way to handle requests like yours; having only column lists makes searching a lot easier. Line orientation means you need to repeat some search work for each line. Things like date matching require data that is certainly not available in the CSV, such as how dates are presented and which columns are dates.
An example of obtaining a column-oriented structure (however, including loading the entire file):
import csv allrows=list(csv.reader(open('test.csv')))
via Yanne Vernier
Surely there is some good link for this general topic?