Is there a pythonic way to find out which lines in a CSV file contain headers and values, and which lines contain garbage, and then get the header / value lines in data frames?
I am relatively new to python and use it to read several CSV files exported from a scientific tool directory, and when working with CSV so far, I have not used the library by default for other tasks pandas. However, this CSV data may vary depending on the number of โtestsโ recorded on each instrument.
The column headings and the data structure are the same between the tools, but there is a โpreambleโ that separates each test that may change. Thus, I get backups that look something like this (for this example, there are two tests, but there can potentially be any number of tests):
blah blah here a test and
here some information
you don't care about
even a little bit
header1, header2, header3
1, 2, 3
4, 5, 6
oh you have another test
here some more garbage
that different than the last one
this should make
life interesting
header1, header2, header3
7, 8, 9
10, 11, 12
13, 14, 15
If it was a fixed-length preamble every time I just used the skiprow parameter, but the preamble is a variable length, and the number of lines in each test has a variable length.
My ultimate goal is to combine all the tests and get something like:
header1, header2, header3
1, 2, 3
4, 5, 6
7, 8, 9
10, 11, 12
13, 14, 15
What can I then manipulate using pandas, as usual.
I tried the following to find the first line with my expected headers:
import csv
import pandas as pd
with open('my_file.csv', 'rb') as input_file:
for row_num, row in enumerate(csv.reader(input_file, delimiter=',')):
if len(row) > 0:
if "['header1', 'header2', 'header3']" in str(row):
header_row = row_num
df = pd.read_csv('my_file.csv', skiprows = header_row, header=0)
print df
, , , , , header_row , , :
header1 header2 header3
0 7 8 9
1 10 11 12
2 13 14 15
, , header/dataset , / .
, , , csv, pandas.