Pandas: Split data frame based on empty strings

I have the following data frame.

  id abc
 1 34353 917998 x        
 2 34973 980340 x      
 3 87365 498097 x      
 4 98309 486547 x      
 5 87699 475132         
 6 52734 4298894         
 7 8749267 4918066 x    
 8 89872 18103         
 9 589892 4818086 y    
 10 765 4063 y 
 11 32 369 418165 y
 12 206 2918137    
 13 554 3918072    
 14 1029 1918051 x
 15 2349243 4918064

For each set of blank lines, for example. 5.6 I want to create a new data frame. It is required to create several data frames. As indicated below:

  id AB
 5 87699 475132         
 6 52734 4298894
  id AB
 8 89872 18103      
  id AB
 12 206 2918137    
 13 554 3918072          
  id AB
 15 2349243 4918064          
+5
source share
3 answers
isnull = df.C.isnull() partitions = (isnull != isnull.shift()).cumsum() gb = df[isnull].groupby(partitions) 

At this stage, we have achieved the goal of creating a separate data frame for each adjacent NaN group in df . They are available using the gb.get_group() method for each key in gb.groups

To check, we will concatenate the mapping.

 keys = gb.groups.keys() dfs = pd.concat([gb.get_group(g) for g in keys], keys=keys) dfs 

enter image description here

Setting for df

I used the reader @ Alberto Garcia-Rabosho

 import io import pandas as pd # Create your sample dataframe data = io.StringIO("""\ id ABC 1 34353 917998 x 2 34973 980340 x 3 87365 498097 x 4 98309 486547 x 5 87699 475132 6 52734 4298894 7 8749267 4918066 x 8 89872 18103 9 589892 4818086 y 10 765 4063 y 11 32369 418165 y 12 206 2918137 13 554 3918072 14 1029 1918051 x 15 2349243 4918064 """) df = pd.read_csv(data, delim_whitespace=True) 
+5
source

try the following:

 x = df[pd.isnull(df.C)] splitter = x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index dfs = np.split(x, splitter) for x in dfs: print(x, '\n') 

Output:

 In [264]: for x in l: .....: print(x, '\n') .....: id ABC 4 5 87699 475132 NaN 5 6 52734 4298894 NaN id ABC 7 8 89872 18103 NaN id ABC 11 12 206 2918137 NaN 12 13 554 3918072 NaN id ABC 14 15 2349243 4918064 NaN 

Explanation:

 In [267]: x = df[pd.isnull(df.C)] In [268]: x Out[268]: id ABC 4 5 87699 475132 NaN 5 6 52734 4298894 NaN 7 8 89872 18103 NaN 11 12 206 2918137 NaN 12 13 554 3918072 NaN 14 15 2349243 4918064 NaN In [269]: x.ix[pd.isnull(df.C), 'id'] Out[269]: 4 5 5 6 7 8 11 12 12 13 14 15 Name: id, dtype: int64 In [270]: x['id'].diff().fillna(0) Out[270]: 4 0.0 5 1.0 7 2.0 11 4.0 12 1.0 14 2.0 Name: id, dtype: float64 In [271]: x['id'].diff().fillna(0) > 1 Out[271]: 4 False 5 False 7 True 11 True 12 False 14 True Name: id, dtype: bool In [272]: (x['id'].diff().fillna(0) > 1).reset_index(drop=True) Out[272]: 0 False 1 False 2 True 3 True 4 False 5 True Name: id, dtype: bool In [273]: x.reset_index()[x['id'].diff().fillna(0) > 1).reset_index(drop=True)] Out[273]: index id ABC 2 7 8 89872 18103 NaN 3 11 12 206 2918137 NaN 5 14 15 2349243 4918064 NaN In [274]: x.reset_index()[(x['id'].diff().fillna(0) > 1).reset_index(drop=True)].index Out[274]: Int64Index([2, 3, 5], dtype='int64') 
+1
source

Here's a bit confusing, perhaps not very quick solution using itertools.groupby (which famously combines sequences of consecutive identical values).

 from itertools import groupby import io import pandas as pd # Create your sample dataframe data = io.StringIO("""\ id ABC 1 34353 917998 x 2 34973 980340 x 3 87365 498097 x 4 98309 486547 x 5 87699 475132 6 52734 4298894 7 8749267 4918066 x 8 89872 18103 9 589892 4818086 y 10 765 4063 y 11 32369 418165 y 12 206 2918137 13 554 3918072 14 1029 1918051 x 15 2349243 4918064 """) df = pd.read_csv(data, delim_whitespace=True) # Create a boolean column that encodes which rows you want to keep df['grouper'] = df['C'].notnull() # Isolate the indices of the rows you want to keep, grouped by contiguity groups = [list(map(lambda x: x[1]['id'], list(l))) for k, l in groupby(df.iterrows(), key=lambda x: x[1]['grouper']) if not k] print(groups) # => [[5, 6], [8], [12, 13], [15]] # Gather the sub-dataframes whose indices match `groups` dfs = [] for g in groups: dfs.append(df[['A', 'B']][df['id'].isin(g)]) # Inspect what you got for df in dfs: print(df) 

Output:

  AB 4 87699 475132 5 52734 4298894 AB 7 89872 18103 AB 11 206 2918137 12 554 3918072 AB 14 2349243 4918064 
+1
source

All Articles