Conditional string readable by csv in pandas

I have big csvs where I am only interested in a subset of strings. In particular, I would like to read in all lines that occur before a particular condition is met.

For example, if read_csv provides a data file:

ABC 1 34 3.20 'b' 2 24 9.21 'b' 3 34 3.32 'c' 4 24 24.3 'c' 5 35 1.12 'a' ... 1e9 42 2.15 'd' 

Is there a way to read all the lines in csv until col B exceeds 10. In the above example, I would like to read:

  ABC 1 34 3.20 'b' 2 24 9.21 'b' 3 34 3.32 'c' 4 24 24.3 'c' 

I know how to throw out these lines as soon as I read the framework, but at this point I already did all these calculations by reading them. I do not have access to the index of the last row before reading csv (no skipfooter , please)

+5
source share
2 answers

You can read csv in pieces. Since pd.read_csv returns an iterator when the chunksize parameter is chunksize , you can use itertools.takewhile to read only as many blocks as you need without reading the entire file.

 import itertools as IT import pandas as pd chunksize = 10 ** 5 chunks = pd.read_csv(filename, chunksize=chunksize, header=None) chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks) df = pd.concat(chunks) mask = df['B'] < 10 df = df.loc[mask] 

Or, to avoid using df.loc[mask] to remove unnecessary lines from the last fragment, perhaps a cleaner solution would be to define a custom generator:

 import itertools as IT import pandas as pd def valid(chunks): for chunk in chunks: mask = chunk['B'] < 10 if mask.all(): yield chunk else: yield chunk.loc[mask] break chunksize = 10 ** 5 chunks = pd.read_csv(filename, chunksize=chunksize, header=None) df = pd.concat(valid(chunks)) 
+12
source

I would say the easy way described here:

http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

 df[df['B'] > 10] 
0
source

Source: https://habr.com/ru/post/1212282/


All Articles