How to split a large CSV data file into separate data files?

I have a CSV file whose first line contains the names of the variables, and the remaining lines contain the data. What a good way to split it into files containing only one variable in Python? Will this solution be reliable? For instance. What if the input file is 100G in size? I am trying to follow a dividend strategy, but new to Python. Thanks in advance for your help!

The input files look like

var1,var2,var3
1,2,hello
2,5,yay
...

I want to create 3 (or as many variables as you like) files var1.csv, var2.csv, var3.csv so that the files resemble File1

var1
1
2
...

File2

var2
2
5
...

file3

var3
hello
yay
+2
source share
5 answers

(, , ), , , ( , ;-), - :

import csv

def splitit(inputfilename):
  with open(inputfilename, 'rb') as inf:
    inrd = csv.reader(inf)
    names = next(inrd)
    outfiles = [open(n+'.csv', 'wb') for n in names]
    ouwr = [csv.writer(w) for w in outfiles]
    for w, n in zip(ouwr, names):
      w.writerow([n])
    for row in inrd:
      for w, r in zip(ouwr, row):
        ouwr.writerow([r])
    for o in outfiles: o.close()
+2

n , . n . ( , 100GB?)

+1

Python ,

awk -F"," 'NR==1{for(i=1;i<=NF;i++)a[i]=$i}NR>1{for(i=1;i<=NF;i++){print $i>a[i]".txt"}}' file
+1

If your file is 100 GB, then the IO drive will become your bottleneck. Consider using a gzipmodule to read (pre-compressed file) and write to speed things up.

+1
source

Try the following:

http://ondra.zizka.cz/stranky/programovani/ruzne/querying-transforming-csv-using-sql.texy

crunch input.csv output.csv "SELECT AVG(duration) AS durAvg FROM (SELECT * FROM indata ORDER BY duration LIMIT 2 OFFSET 6)"
0
source

All Articles