Numpy.loadtxt: how to ignore comma delimiters that appear inside quotes?

I have a csv file where a data line might look like this:

10, "Apple Banana", 20, ...

When I load data in Python, an extra comma inside quotes changes all my column indexes, so my data is no longer a consistent structure. Although I could write a complex algorithm that iterates through each line and fixes the problem, I was hoping there was an elegant way to just pass an additional loadtxt parameter (or some other function) that properly ignores commas inside quotes and processes the whole quote as one value.

Please note that when loading a CSV file into Excel, Excel correctly recognizes the row as a single value.

+4
source share
4 answers

One way to do this with a single call to the numpy function is to use it np.fromregex, which allows you to use Python regular expression syntax for arbitrary analysis of the contents of your text file. For instance:

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', np.object)

gives you:

array([['10', 'Apple, Banana', '20'],
       ['30', 'Orange, Watermelon', '40']], dtype=object)

To break this regular expression a bit, '(\d+)'will match one or more digits, and '"(.+)"'will match one or more characters inside double quotes. np.fromregextries to match this expression in each line of your file .csv, and the parts inside the brackets are accepted as separate elements in each line of the output array.

"" .csv, dtypes :

np.fromregex('tmp.csv', r'(\d+),"(.+)",(\d+)', 'i8, S20, i8')

:

array([(10, 'Apple, Banana', 20), (30, 'Orange, Watermelon', 40)], 
      dtype=[('f0', '<i8'), ('f1', 'S20'), ('f2', '<i8')])
+2

. loadtxt ( genfromtxt) , , . , . python csv . pandas .

loadtxt . - -, . , , .

. , , . .

numpy.genfromtxt csv ,

:

txt = """10,"Apple, Banana",20
30,"Pear, Orange",40
50,"Peach, Mango",60
"""

def foo(astr):
    # replace , outside quotes with ;
    # a bit crude and specialized
    x = astr.split('"')
    return ';'.join([i.strip(',') for i in x]) 

txt1 = [foo(astr) for astr in txt.splitlines()]
txtgen = (foo(astr) for astr in txt.splitlines())  # or as generator
# ['10;Apple, Banana;20', '30;Pear, Orange;40', '50;Peach, Mango;60']
np.genfromtxt(txtgen, delimiter=';', dtype=None)

:

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

np.fromregex. genfromtxt . txt, :

s=StringIO.StringIO(txt)
np.fromregex(s, r'(\d+),"(.+)",(\d+)', dtype='i4,S20,i4')

:

pat=re.compile(r'(\d+),"(.+)",(\d+)'); dt=np.dtype('i4,S20,i4')
np.array(pat.findall(txt),dtype=dt)

(f.read()) findall, :

[('10', 'Apple, Banana', '20'),
 ('30', 'Pear, Orange', '40'),
 ('50', 'Peach, Mango', '60')]

- , .

, . , .


foo, fromregex . csv.reader . join , reader , genfromtxt ( "split" ).

from csv import reader
s=StringIO.StringIO(txt)
np.genfromtxt((';'.join(x) for x in reader(s)), delimiter=';', dtype=None)

array([(10, 'Apple, Banana', 20), (30, 'Pear, Orange', 40),
       (50, 'Peach, Mango', 60)], 
      dtype=[('f0', '<i4'), ('f1', 'S13'), ('f2', '<i4')])

, fromregex, reader np.array:

np.array([tuple(x) for x in reader(s)], dtype='i4,S20,i4')
+1

, .

def transformCommas(line):
    out = ''
    insideQuote = False
    for c in line:
        if c == '"':
            insideQuote = not insideQuote
        if insideQuote == True and c == ',':
            out += '.'
        else:
            out += c
    return out

f = open("data/raw_data_all.csv", "rb")
replaced = (transformCommas(line) for line in f)
rawData = numpy.loadtxt(replaced,delimiter=',', skiprows=0, dtype=str)

:

1366x768,18,"5,237",73.38%,"3,843",79.55%,1.75,00:01:26,4.09%,214,$0.00
1366x768,22,"5,088",76.04%,"3,869",78.46%,1.82,00:01:20,3.93%,200,$0.00
1366x768,17,"4,887",74.34%,"3,633",78.37%,1.81,00:01:19,3.25%,159,$0.00
+1

csv Python: https://docs.python.org/2/library/csv.html

csv:

10,"Apple,Banana",20
20,"Orange,Watermelon",30

script:

from csv import reader

with open('data.csv') as f:
    for row in reader(f):
        print row

:

['10', 'Apple,Banana', '20']
['20', 'Orange,Watermelon', '30']

Since loadtxt requires iteration, pass it reader(f):

with open('data.csv') as f:
    data = loadtxt(reader(f), ...)
0
source

All Articles