Extract columns from multiple text files using Python

Question

Extract columns from multiple text files using Python

I have a folder with 5 text files in it related to different sites -

The header is formatted as follows:

Rockspring_18_SW.417712.WRFc36.ET.2000-2050.txt Rockspring_18_SW.417712.WRFc36.RAIN.2000-2050.txt WICA.399347.WRFc36.ET.2000-2050.txt WICA.399347.WRFc36.RAIN.2000-2050.txt

therefore, basically the file name follows the format, (site name). (site number). (WRFc36). (some variable). (2000-2050.txt

Each of these text files has a similar format to it without a title bar: Year Month Day Value (consisting of ~ 18500 lines in each text file)

I want Python to search for similar file names (with the site name and site number) and select the first or third columns of data from one of the files and insert them into a new txt file. I also want to copy and paste 4 columns from each variable for the site (rain, et, etc.) and paste them in a specific order in a new file.

I know how to capture data using the csv module (and define a new dialect for the spatial divider) from ALL files and print to a new text file, but I'm not sure how to automate the creation of a new file for each name / site number and make sure that mine variables are displayed in the correct order -

The output I want to use is one text file (not 5) for each site with the following format (year, month, day, variable1, variable2, variable3, variable4, variable5) for ~ 18500 lines ...

I'm sure I'm looking at something very simple here ... it looks like it will be rather rudimentary ... but ... any help would be greatly appreciated!

Update

========

I updated the code to reflect the comments below.
http://codepad.org/3mQEM75e

from the collection import defaultdict import glob import csv

 #Create dictionary of lists-- [A] = [Afilename1, Afilename2, Afilename3...] # [B] = [Bfilename1, Bfilename2, Bfilename3...] def get_site_files(): sites = defaultdict(list) #to start, I have a bunch of files in this format --- #"site name(unique)"."site num(unique)"."WRFc36"."Variable(5 for each site name)"."2000-2050" for fname in glob.glob("*.txt"): #split name at every instance of "." parts = fname.split(".") #check to make sure i only use the proper files-- having 6 parts to name and having WRFc36 as 3rd part if len(parts)==6 and parts[2]=='WRFc36': #Make sure site name is the full unique identifier, the first and second "parts" sites[parts[0]+"."+parts[1]].append(fname) return sites #hardcode the variables for method 2, below Var=["TAVE","RAIN","SMOIS_INST","ET","SFROFF"] def main(): for site_name, files in get_site_files().iteritems(): print "Working on *****"+site_name+"*****" ####Method 1- I'd like to not hardcode in my variables (as in method 2), so I can use this script in other applications. for filename in files: reader = csv.reader(open(filename, "rb")) WriteFile = csv.writer(open("XX_"+site_name+"_combined.txt","wb")) for row in reader: row = reader.next() ####Method 2 works (mostly), but skips a LOT of random lines of first file, and doesn't utilize the functionality built into my dictionary of lists... ## reader0 = csv.reader(open(site_name+".WRFc36."+Var[0]+".2000-2050.txt", "rb")) #I'd like to copy ALL columns from the first file ## reader1 = csv.reader(open(site_name+".WRFc36."+Var[1]+".2000-2050.txt", "rb")) # and just the fourth column from all the rest of the files ## reader2 = csv.reader(open(site_name+".WRFc36."+Var[2]+".2000-2050.txt", "rb")) # (the columns 1-3 are the same for all files) ## reader3 = csv.reader(open(site_name+".WRFc36."+Var[3]+".2000-2050.txt", "rb")) ## reader4 = csv.reader(open(site_name+".WRFc36."+Var[4]+".2000-2050.txt", "rb")) ## WriteFile = csv.writer(open("XX_"+site_name+"_COMBINED.txt", "wb")) #creates new command to write a text file ## ## for row in reader0: ## row = reader0.next() ## row1 = reader1.next() ## row2 = reader2.next() ## row3 = reader3.next() ## row4 = reader4.next() ## WriteFile.writerow(row + row1 + row2 + row3 + row4) ## print "***finished with site***" if __name__=="__main__": main()

+4

python text extract csv

Thegeoengineer Aug 13 '12 at 21:35

source share

2 answers

Regarding getting file names, I would use the following:

 import os # Gets a list of all file names that end in .txt # ON *nix file_names = os.popen('ls *.txt').read().split('\n') # ON Windows file_names = os.popen('dir /b *.txt').read().split('\n')

Then, to get the elements, usually separated by periods, use:

 # For some file_name in file_names file_name.split('.')

Then you can go to the comparison and extract the columns you need (using open (file_name, 'r') or your CSV parser)

Michael G.

+1

mjgpy3 Aug 13 '12 at 21:41

source share

Hugh bothwell · Accepted Answer · 2012-08-13T23:57:00+0000

Here's an easier way to iterate over your files grouped by site.

 from collections import defaultdict import glob def get_site_files(): sites = defaultdict(list) for fname in glob.glob('*.txt'): parts = fname.split('.') if len(parts)==6 and parts[2]=='WRFc36': sites[parts[0]].append(fname) return sites def main(): for site,files in get_site_files().iteritems(): # you need to better explain what you are trying to do here! print site, files if __name__=="__main__": main()

I still do not understand your cutting and pasting columns - you need to explain more clearly what you are trying to accomplish.

Extract columns from multiple text files using Python

Update

More articles: