How to convert tab split, channel split to csv file format in Python

I have a text file (.txt) that can be in tab delimited format or in a format separate from the pipe, and I need to convert it to a CSV file format. I am using python 2.6. Can someone suggest me how to define a delimiter in a text file, read the data and then convert it to a comma separated file.

Thank you in advance

+4
source share
5 answers

I am afraid that you cannot define the delimiter without knowing what it is. The problem with CSV is that quoting ESR :

Microsoft's CSV version is an example tutorial on how not to create a text file format.

The delimiter must be somehow shielded if it can appear in the fields. Not knowing how the escape occurs, it is difficult to automatically identify it. Escaping can be done using the UNIX method using the backslash '\' or the Microsoft path using quotation marks, which must then be escaped. This is not a trivial task.

So my suggestion is to get the full documentation from whoever generates the file you want to convert. Then you can use one of the approaches suggested in other answers or in some variant.

Edit:

Python provides csv.Sniffer to help you determine the format of your DSV. If your input looks like this (note the quoted separator in the first field of the second line):

a|b|c "a|b"|c|d foo|"bar|baz"|qux 

You can do it:

 import csv csvfile = open("csvfile.csv") dialect = csv.Sniffer().sniff(csvfile.read(1024)) csvfile.seek(0) reader = csv.DictReader(csvfile, dialect=dialect) for row in reader: print row, # => {'a': 'a|b', 'c': 'd', 'b': 'c'} {'a': 'foo', 'c': 'qux', 'b': 'bar|baz'} # write records using other dialect 
+6
source

Your strategy may be as follows:

  • analyze the BOTH file with a tab-separated csv reader and a channel-separated csv reader
  • compute some statistics on the resulting rows to determine which result is the one you want to write. An idea can count the total number of fields in two sets of records (expecting tabs and pipe are not so common). Another one (if your data is highly structured and you expect the same number of fields in each row), you could measure the standard deviation of the number of fields in a row and accept a set of records with the smallest standard deviation.

In the following example, you will find simpler statistics (total number of fields)

 import csv piperows= [] tabrows = [] #parsing | delimiter f = open("file", "rb") readerpipe = csv.reader(f, delimiter = "|") for row in readerpipe: piperows.append(row) f.close() #parsing TAB delimiter f = open("file", "rb") readertab = csv.reader(f, delimiter = "\t") for row in readerpipe: tabrows.append(row) f.close() #in this example, we use the total number of fields as indicator (but it not guaranteed to work! it depends by the nature of your data) #count total fields totfieldspipe = reduce (lambda x,y: x+ y, [len(f) for f in piperows]) totfieldstab = reduce (lambda x,y: x+ y, [len(f) for f in tabrows]) if totfieldspipe > totfieldstab: yourrows = piperows else: yourrows = tabrows #the var yourrows contains the rows, now just write them in any format you like 
+1
source

Like this

 from __future__ import with_statement import csv import re with open( input, "r" ) as source: with open( output, "wb" ) as destination: writer= csv.writer( destination ) for line in input: writer.writerow( re.split( '[\t|]', line ) ) 
0
source
 for line in open("file"): line=line.strip() if "|" in line: print ','.join(line.split("|")) else: print ','.join(line.split("\t")) 
0
source

I would suggest taking part of the sample code from the existing answers, or perhaps it is better to use the csv module from python and modify it to first assume that the tab is split, then split by channels and creates two output files, separated by commas. Then you will visually examine both file to determine which one you want and select this.

If you actually have many files, you need to try to find a way to determine which file is. One example has the following:

 if "|" in line: 

This may be enough: if the first line of the file contains a channel, then perhaps the entire file is divided by channel, otherwise suppose that the file is divided by a tab,

Alternatively, fix the file so that it contains a key field in the first row that is easy to identify, or maybe the first row contains the column headers that might be detected.

0
source

All Articles