Python / bash SQL for tsv flatfiles (without sqlite)

Question

Python / bash SQL for tsv flatfiles (without sqlite)

Background:

sqlite is great for performing SQL operations with data loaded into databases, but many times in my work I need to make selections, joins, and where statements about files that do not load into the database and do not necessarily cost the load / initialization time to the database data. In addition, sqlite's random access characteristics often make operations that run on each row in the database slower.

Question:

Is there a set of commands like SQL / fxns (preferably python / bash) that does not need sqlite and only works with unencrypted tab layouts? For example, instead of using tables to select rows, simply use column numbers.

Example

select col1,col2,col3 from fileName.tsv where col1[int] < 3

Note: I understand that a lot can be achieved with awk, cut, bash -join, etc .; I was wondering if there was anything else SQLesque?

+4

python sql bash flat-file tsv

sequenceGeek 20 sept '11 at 23:26

source share

3 answers

You can hack something together using the csv module and a list of concepts :

 import csv reader = csv.reader(open('data.csv', 'r')) rows = [row for row in reader] # select * from data where first column < 4 # this won't actually work as-is! see the edit below [row for row in rows if row[0] < 4] # select * from data where second column >= third column [row for row in rows if row[1] >= row[2]] # select columns 1 and 3 from data where first column is "mykey" [[row[0], row[2]] for row in rows if row[0] == "mykey"]

Perhaps you can do even more impressive things with the Pythons tools, although if you are not already familiar with FP, there is probably just too much to learn about this; -)

Edit: A few more tips:

If you are going to execute only one script request, you can cut out the intermediate data store ( rows in my example):

 import csv reader = csv.reader(open('data.csv', 'r')) result = [row for row in reader if row[0] == "banana"]

The csv reader generates all its output as text, so if you want to process a single column, for example. an integer that you must make yourself. For example, if your second and third columns are integers,
```
 import csv reader = csv.reader(open('data.csv', 'r')) rows = [[row[0], int(row[1]), int(row[2])] for row in reader] # perform a "select" on rows now 
```
(This means that my first example above usually doesn't work as is.) If all your columns are integers, you can call the map function:
```
 import csv reader = csv.reader(open('data.csv', 'r')) rows = [map(int, row) for row in reader] 
```

+2

bdesham 21 sept '11 at 0:00

source share

I would highly recommend the Microsoft 2.2 log parser ... except that I assume you are using Linux. Pretty sure it won't work. But I will put the links here if someone is not using Linux.

http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24659 http://www.codinghorror.com/blog/2005/08/microsoft-logparser.html

0

Jody 21 sept '11 at 0:06

source share

user554546 · Accepted Answer · 2011-09-20T23:45:30+0000

After googling python equivalent of DBD::CSV I found KirbyBase . It looks as if it were in line with the bill.

Since I don't use Python at all, I have never tried it.

Edited to add: Well, looking at the documentation, the query commands are not exactly SQL, but they are much more SQLesque than using awk.

Python / bash SQL for tsv flatfiles (without sqlite)

More articles: