Is it possible to use read_csv to read only certain lines?

I have a csv file that looks like this:

TEST 2012-05-01 00:00:00.203 ON 1 2012-05-01 00:00:11.203 OFF 0 2012-05-01 00:00:22.203 ON 1 2012-05-01 00:00:33.203 OFF 0 2012-05-01 00:00:44.203 OFF 0 TEST 2012-05-02 00:00:00.203 OFF 0 2012-05-02 00:00:11.203 OFF 0 2012-05-02 00:00:22.203 OFF 0 2012-05-02 00:00:33.203 OFF 0 2012-05-02 00:00:44.203 ON 1 2012-05-02 00:00:55.203 OFF 0 

and cannot get rid of the string "TEST" .

Is it possible to check whether a line starts with a date and read only those that do?

+7
source share
4 answers
 from cStringIO import StringIO import pandas s = StringIO() with open('file.csv') as f: for line in f: if not line.startswith('TEST'): s.write(line) s.seek(0) # "rewind" to the beginning of the StringIO object pandas.read_csv(s) # with further parameters… 
+7
source

When you get row from csv.reader , and when you can be sure that the first element is a string, you can use

 if not row[0].startswith('TEST'): process(row) 
+3
source

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html?highlight=read_csv#pandas.io.parsers.read_csv

skiprows: list-like or integer Line numbers to skip (0-indexed) or number of lines to skip (int)

Skip [0, 6] to skip lines with "TEST".

+2
source

Another option, since I just ran into this problem:

 import pandas as pd import subprocess grep = subprocess.check_output(['grep', '-n', '^TITLE', filename]).splitlines() bad_lines = [int(s[:s.index(':')]) - 1 for s in grep] df = pd.read_csv(filename, skiprows=bad_lines) 

It is less portable than @eumiro (read: it may not work on Windows) and requires reading the file twice, but has the advantage that you do not need to store all the contents of the file in memory.

Of course, you could do the same thing as grep in Python, but it will probably be slower.

0
source

All Articles