Python pandas: read csv with multiple tables repeated by preamble

Question

Python pandas: read csv with multiple tables repeated by preamble

Is there a pythonic way to find out which lines in a CSV file contain headers and values, and which lines contain garbage, and then get the header / value lines in data frames?

I am relatively new to python and use it to read several CSV files exported from a scientific tool directory, and when working with CSV so far, I have not used the library by default for other tasks pandas. However, this CSV data may vary depending on the number of “tests” recorded on each instrument.

The column headings and the data structure are the same between the tools, but there is a “preamble” that separates each test that may change. Thus, I get backups that look something like this (for this example, there are two tests, but there can potentially be any number of tests):

blah blah here a test and  
here some information  
you don't care about  
even a little bit  
header1, header2, header3  
1, 2, 3  
4, 5, 6  

oh you have another test  
here some more garbage  
that different than the last one  
this should make  
life interesting  
header1, header2, header3  
7, 8, 9  
10, 11, 12  
13, 14, 15

If it was a fixed-length preamble every time I just used the skiprow parameter, but the preamble is a variable length, and the number of lines in each test has a variable length.

My ultimate goal is to combine all the tests and get something like:

header1, header2, header3  
1, 2, 3  
4, 5, 6  
7, 8, 9  
10, 11, 12  
13, 14, 15

What can I then manipulate using pandas, as usual.

I tried the following to find the first line with my expected headers:

import csv
import pandas as pd

with open('my_file.csv', 'rb') as input_file:    
    for row_num, row in enumerate(csv.reader(input_file, delimiter=',')):
        # The CSV module will return a blank list []
        # so added the len(row)>0 so it doesn't error out
        # later when searching for a string
        if len(row) > 0:
            # There probably a better way to find it, but I just convert
            # the list to a string then search for the expected header
            if "['header1', 'header2', 'header3']" in str(row):
                header_row = row_num

    df = pd.read_csv('my_file.csv', skiprows = header_row, header=0)
    print df

, , , , , header_row , , :

   header1   header2   header3  
0        7         8           9
1       10        11          12
2       13        14          15

, , header/dataset , / .

, , , csv, pandas.

+4

python pandas csv

The J 08 . '16 15:06

2

, pythonic Pandas ( , )

import pandas as pd
from StringIO import StringIO

#define an example to showcase the solution
st = """blah blah here a test and
here some information  
you don't care about  
even a little bit  
header1, header2, header3  
1, 2, 3  
4, 5, 6  

oh you have another test  
here some more garbage  
that different than the last one  
this should make  
life interesting  
header1, header2, header3  
7, 8, 9  
10, 11, 12  
13, 14, 15""" 

# 1- read the data with pd.read_csv  
# 2- specify that you want to drop bad lines, error_bad_lines=False 
# 3- The header has to be the first row of the file. Since this is not the case, let manually define it with names=[...] and header=None.    
data = pd.read_csv(StringIO(st), delimiter=",", names=["header1","header2", "header3"], error_bad_lines=False, header=None) 

# the trash will be loaded as follows 
# blah blah here a test and       NaN         NaN
# let drop these rows 
data = data.dropna()

# remove the rows which contain "header1","header2", "header3"
mask = data["header1"].str.contains('header*')
data = data[~mask]
print data

dataFrame :

   header1 header2 header3
5        1       2     3  
6        4       5     6  
13       7       8     9  
14      10      11    12  
15      13      14      15

0

MedAli 09 . '16 8:56

Robᵩ · Accepted Answer · 2016-04-08T17:28:33+0000

. , , csv.reader(), .

import pandas as pd
import csv
import sys


def ignore_comments(fp, start_fn, end_fn, keep_initial):
    state = 'keep' if keep_initial else 'start'
    for line in fp:
        if state == 'start' and start_fn(line):
            state = 'keep'
            yield line
        elif state == 'keep':
            if end_fn(line):
                state = 'drop'
            else:
                yield line
        elif state == 'drop':
            if start_fn(line):
                state = 'keep'

if __name__ == "__main__":

    df = open('x.in')
    df = csv.reader(df, skipinitialspace=True)
    df = ignore_comments(
        df,
        lambda x: x == ['header1', 'header2', 'header3'],
        lambda x: x == [],
        False)

    df = pd.read_csv(df, engine='python')
    print df

Python pandas: read csv with multiple tables repeated by preamble

More articles: