SQL query engine for text files on Linux?

We use grep, cut, sort, uniq and join on the command line all the time to parse the data. They work great, although there are flaws. For example, you must provide column numbers for each tool. We often have large files (many columns) and a column heading that lists column names. In fact, our files are very similar to SQL tables. I'm sure there is a driver (ODBC?) That will work with delimited text files and some query mechanism that this driver will use, so we could just use SQL queries in our text files. Since the analysis is usually performed ad hoc, this should be the minimum setting to request new files (just use the files that I specify in this directory), rather than declaring certain tables in some configuration.

Practically speaking, what is the easiest? That is, the SQL engine and driver that are easiest to configure and use to apply to text files?

+7
command-line sql text
source share
6 answers

Canceling someone else's suggestion, here is the Python script for sqlite3. A bit detailed, but it works.

I do not like to completely copy the file to remove the title bar, but I do not know how else to convince sqlite3.import to skip it. I could create INSERT statements, but that looks just as bad, if not worse.

Sample call:

 $ sql.py --file foo --sql "select count (*) from data"

The code:

 #! / usr / bin / env python

 "" "Run a SQL statement on a text file" ""

 import os
 import sys
 import getopt
 import tempfile
 import re

 class Usage (Exception):
     def __init __ (self, msg):
         self.msg = msg

 def runCmd (cmd):
     if os.system (cmd):
         print "Error running" + cmd
         sys.exit (1)
         # TODO (dan): Return actual exit code

 def usage ():
     print >> sys.stderr, "Usage: sql.py --file file --sql sql"

 def main (argv = None):
     if argv is None:
         argv = sys.argv

     try:
         try:
             opts, args = getopt.getopt (argv [1:], "h",
                                        ["help", "file =", "sql ="])
         except getopt.error, msg:
             raise Usage (msg)
     except Usage, err:
         print >> sys.stderr, err.msg
         print >> sys.stderr, "for help use --help"
         return 2

     filename = None
     sql = None
     for o, a in opts:
         if o in ("-h", "--help"):
             usage ()
             return 0
         elif o in ("--file"):
             filename = a
         elif o in ("--sql"):
             sql = a
         else:
             print "Found unexpected option" + o

     if not filename:
         print >> sys.stderr, "Must give --file"
         sys.exit (1)
     if not sql:
         print >> sys.stderr, "Must give --sql"
         sys.exit (1)

     # Get the first line of the file to make a CREATE statement
     #
     # Copy the rest of the lines into a new file (datafile) so that
     # sqlite3 can import data without header.  If sqlite3 could skip
     # the first line with .import, this copy would be unnecessary.
     foo = open (filename)
     datafile = tempfile.NamedTemporaryFile ()
     first = True
     for line in foo.readlines ():
         if first:
             headers = line.rstrip (). split ()
             first = False
         else:
             print >> datafile, line,
     datafile.flush ()
     #print datafile.name
     #runCmd ("cat% s"% datafile.name)
     # Create columns with NUMERIC affinity so that if they are numbers,
     # SQL queries will treat them as such.
     create_statement = "CREATE TABLE data (" + ",". join (
         map (lambda x: "`% s` NUMERIC "% x, headers)) +"); "

     cmdfile = tempfile.NamedTemporaryFile ()
     #print cmdfile.name
     print >> cmdfile, create_statement
     print >> cmdfile, ". separator ''"
     print >> cmdfile, ". import '" + datafile.name + "' data"
     print >> cmdfile, sql + ";"
     cmdfile.flush ()
     #runCmd ("cat% s"% cmdfile.name)
     runCmd ("cat% s | sqlite3"% cmdfile.name)

 if __name__ == "__main__":
     sys.exit (main ())
+3
source share

David Malcolm wrote a small tool called " squeal " (formerly "show") that allows you to use command line syntax like SQL to parse text files of various formats, including CSV.

Example on the squeal homepage:

$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc count(*)|source | --------+--------------------+ 1633 |kernel | 1324 |NetworkManager | 98 |ntpd | 70 |avahi-daemon | 63 |dhclient | 48 |setroubleshoot | 39 |dnsmasq | 29 |nm-system-settings | 27 |bluetoothd | 14 |/usr/sbin/gpm | 13 |acpid | 10 |init | 9 |pcscd | 9 |pulseaudio | 6 |gnome-keyring-ask | 6 |gnome-keyring-daemon| 6 |gnome-session | 6 |rsyslogd | 5 |rpc.statd | 4 |vpnc | 3 |gdm-session-worker | 2 |auditd | 2 |console-kit-daemon | 2 |libvirtd | 2 |rpcbind | 1 |nm-dispatcher.action| 1 |restorecond | 
+5
source share

Perhaps write a script that creates an instance of SQLite (perhaps in memory), import your data from the / stdin file (accepts your data format), run a query, and then exit?

Depending on the amount of data, performance may be acceptable.

+3
source share

MySQL has a CVS storage engine that can do what you need if your files are CSV files.

Otherwise, you can use mysqlimport to import text files into MySQL. You can create a wrapper around mysqlimport that calculates columns, etc. And creates the necessary table.

You can also use DBD :: AnyData , a Perl module that allows you to access text files such as a database.

However, it sounds very similar to what you really should look at using a database. Is it really easier to store tabular data in text files?

+2
source share

q - Run SQL directly in CSV or TSV files:

https://github.com/harelba/q

+2
source share

I used Microsoft LogParser to request csv files several times ... and it serves the purpose. It was amazing to see such a useful tool from M $, which is also free!

0
source share

All Articles