Combining multiple CSV files without repeating headers (using Python)

I am a beginner with Python. I have several CSV files (over 10) and they all have the same number of columns. I would like to combine all of them into one CSV file, where I won’t have the headers repeated.

Essentially, I only need to have the first line with all the headers, and from now on I need to combine all the lines from all the CSV files. How can I do it?

Here is what I have tried so far.

import glob import csv with open('output.csv','wb') as fout: wout = csv.writer(fout,delimiter=',') interesting_files = glob.glob("*.csv") for filename in interesting_files: print 'Processing',filename # Open and process file h = True with open(filename,'rb') as fin: fin.next()#skip header for line in csv.reader(fin,delimiter=','): wout.writerow(line) 
+14
python csv
source share
5 answers

Although I believe the @valentin method is the best answer, you can do this without using the csv module:

 import glob interesting_files = glob.glob("*.csv") header_saved = False with open('output.csv','wb') as fout: for filename in interesting_files: with open(filename) as fin: header = next(fin) if not header_saved: fout.write(header) header_saved = True for line in fin: fout.write(line) 
+20
source share

If you are on a linux system:

 head -1 director/one_file.csv > output csv ## writing the header to the final file tail -n +2 director/*.csv >> output.csv ## writing the content of all csv starting with second line into final file 
+28
source share

If you don't mind the overhead, you can use pandas, which comes with common python distributions. If you plan to do more with table lists, I recommend using pandas instead of writing your own libraries.

 import pandas as pd import glob interesting_files = glob.glob("*.csv") df_list = [] for filename in sorted(interesting_files): df_list.append(pd.read_csv(filename)) full_df = pd.concat(df_list) full_df.to_csv('output.csv') 

A little more on pandas. Since it is designed to work with tables as data, it knows that the first row is the header. When reading a CSV, it separates the data table from the header, which is stored in the metadata of the dataframe , the standard data type in pandas. If you concatenate several of these dataframes , it only merges data files if their headers are the same. If the headers do not match, it does not work and gives you an error. It is probably good if your directory is contaminated with CSV files from another source.

Another thing: I added sorted() around interesting_files . I assume that your files are named in order, and that order should be kept. I'm not sure about glob, but the os functions do not necessarily return files sorted by their name.

+8
source share

Your indentation is wrong, you need to put the loop inside the block. You can also pass the file object to writer.writerows.

 import csv with open('output.csv','wb') as fout: wout = csv.writer(fout) interesting_files = glob.glob("*.csv") for filename in interesting_files: print 'Processing',filename with open(filename,'rb') as fin: next(fin) # skip header wout.writerows(fin) 
0
source share

Your attempt almost works, but the problems are:

  • You open the file for reading, but close it before writing lines.
  • you never write a name you have to write it once
  • You should also exclude output.csv from "glob", otherwise the output is also in the input!

Here's the fixed code that passes the csv object directly to the csv.writerows method for shorter and faster code. Also write the header from the first file to the output file.

 import glob import csv output_file = 'output.csv' header_written = False with open(output_file,'w',newline="") as fout: # just "wb" in python 2 wout = csv.writer(fout,delimiter=',') # filter out output interesting_files = [x for x in glob.glob("*.csv") if x != output_file] for filename in interesting_files: print('Processing {}'.format(filename)) with open(filename) as fin: cr = csv.reader(fin,delmiter=",") header = cr.next() #skip header if not header_written: wout.writerow(header) header_written = True wout.writerows(cr) 

Please note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, it fails miserably, spoiling the title bar several times / repeating part of it, effectively damaging the file.

The CSV module (or pandas too) gracefully handles these cases.

0
source share

All Articles