Python Pandas. Big data. Merciless TSV file. How to suppress data?

So. We have dirty data stored in a TSV file that I need to analyze. Here is what it looks like

status=200 protocol=http region_name=Podolsk datetime=2016-03-10 15:51:58 user_ip=0.120.81.243 user_agent=Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 user_id=7885299833141807155 user_vhost=tindex.ru method=GET page=/search/ 

And the problem is that some of the rows have different column orders / some of them are missing values, and I need to get rid of them with high performance (since the data sets I work with are up to 100 gigabytes).

 Data = pd.read_table('data/data.tsv', sep='\t+',header=None,names=['status', 'protocol',\ 'region_name', 'datetime',\ 'user_ip', 'user_agent',\ 'user_id', 'user_vhost',\ 'method', 'page'], engine='python') Clean_Data = (Data.dropna()).reset_index(drop=True) 

Now I got rid of the missing values, but one problem still remains! Here's what the data looks like: enter image description here

And here is the problem: enter image description here

As you can see, some columns are offset. I made a very ineffective solution

 ids = Clean_Data.index.tolist() for column in Clean_Data.columns: for row, i in zip(Clean_Data[column], ids): if np.logical_not(str(column) in row): Clean_Data.drop([i], inplace=True) ids.remove(i) 

So now the data looks good ... at least I can work with it! But what is a high-performance alternative to the method that I did above?

Unutbu code update: trace error

 --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-4-52c9d76f9744> in <module>() 8 df.index.names = ['index', 'num'] 9 ---> 10 df = df.set_index('field', append=True) 11 df.index = df.index.droplevel(level='num') 12 df = df['value'].unstack(level=1) /Users/Peter/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity) 2805 if isinstance(self.index, MultiIndex): 2806 for i in range(self.index.nlevels): -> 2807 arrays.append(self.index.get_level_values(i)) 2808 else: 2809 arrays.append(self.index) /Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/multi.pyc in get_level_values(self, level) 664 values = _simple_new(filled, self.names[num], 665 freq=getattr(unique, 'freq', None), --> 666 tz=getattr(unique, 'tz', None)) 667 return values 668 /Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/range.pyc in _simple_new(cls, start, stop, step, name, dtype, **kwargs) 124 return RangeIndex(start, stop, step, name=name, **kwargs) 125 except TypeError: --> 126 return Index(start, stop, step, name=name, **kwargs) 127 128 result._start = start /Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in __new__(cls, data, dtype, copy, name, fastpath, tupleize_cols, **kwargs) 212 if issubclass(data.dtype.type, np.integer): 213 from .numeric import Int64Index --> 214 return Int64Index(data, copy=copy, dtype=dtype, name=name) 215 elif issubclass(data.dtype.type, np.floating): 216 from .numeric import Float64Index /Users/Peter/anaconda/lib/python2.7/site-packages/pandas/indexes/numeric.pyc in __new__(cls, data, dtype, copy, name, fastpath, **kwargs) 105 # with a platform int 106 if (dtype is None or --> 107 not issubclass(np.dtype(dtype).type, np.integer)): 108 dtype = np.int64 109 TypeError: data type "index" not understood 

Pandas Version: 0.18.0-np110py27_0

Update

Everything worked ... Thanks everyone!

+6
source share
3 answers

Suppose you have TSV data, such as:

 status=A protocol=B region_name=C datetime=D user_ip=E user_agent=F user_id=G user_id=G status=A region_name=C user_ip=E datetime=D user_agent=F protocol=B protocol=B datetime=D status=A user_ip=E user_agent=F user_id=G 

The order of the fields may be deceived, and values ​​may be missing. However, you do not need to drop the rows just because the fields are not displayed in a specific order. You can use the field names provided in the row data itself to put the values ​​in the correct columns. For instance,

 import pandas as pd df = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python') df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0) df.columns = ['field', 'value'] df = df.set_index('field', append=True) df.index = df.index.droplevel(level=1) df = df['value'].unstack(level=1) print(df) 

gives

 field datetime protocol region_name status user_agent user_id user_ip index 0 DBCAFGE 1 DBCAFGE 2 DB None AFGE 

To process a large TSV file, you can process the lines in pieces, and then concatenate the processed pieces into a single DataFrame at the end:

 import pandas as pd chunksize = # the number of rows to be processed per iteration dfs = [] reader = pd.read_table('data/data.tsv', sep='\t+',header=None, engine='python', iterator=True, chunksize=chunksize) for df in reader: df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0) df.columns = ['field', 'value'] df.index.names = ['index', 'num'] df = df.set_index('field', append=True) df.index = df.index.droplevel(level='num') df = df['value'].unstack(level=1) dfs.append(df) df = pd.concat(dfs, ignore_index=True) print(df) 

Explanation: Given df :

 In [527]: df = pd.DataFrame({0: ['status=A', 'user_id=G', 'protocol=B'], 1: ['protocol=B', 'status=A', 'datetime=D'], 2: ['region_name=C', 'region_name=C', 'status=A'], 3: ['datetime=D', 'user_ip=E', 'user_ip=E'], 4: ['user_ip=E', 'datetime=D', 'user_agent=F'], 5: ['user_agent=F', 'user_agent=F', 'user_id=G'], 6: ['user_id=G', 'protocol=B', None]}); df .....: .....: .....: .....: .....: .....: .....: Out[527]: 0 1 2 3 4 5 6 0 status=A protocol=B region_name=C datetime=D user_ip=E user_agent=F user_id=G 1 user_id=G status=A region_name=C user_ip=E datetime=D user_agent=F protocol=B 2 protocol=B datetime=D status=A user_ip=E user_agent=F user_id=G None 

you can combine all the values ​​in one column

 In [449]: df.stack() Out[449]: 0 0 status=A 1 protocol=B 2 region_name=C 3 datetime=D 4 user_ip=E 5 user_agent=F 6 user_id=G 1 0 user_id=G 1 status=A 2 region_name=C 3 user_ip=E 4 datetime=D 5 user_agent=F 6 protocol=B 2 0 protocol=B 1 datetime=D 2 status=A 3 user_ip=E 4 user_agent=F 5 user_id=G dtype: object 

and then apply .str.extract(r'([^=]*)=(.*)') to separate the field name from the value:

 In [450]: df = df.stack().str.extract(r'([^=]*)=(.*)', expand=True).dropna(axis=0); df Out[450]: 0 1 0 0 status A 1 protocol B 2 region_name C 3 datetime D 4 user_ip E 5 user_agent F 6 user_id G 1 0 user_id G 1 status A 2 region_name C 3 user_ip E 4 datetime D 5 user_agent F 6 protocol B 2 0 protocol B 1 datetime D 2 status A 3 user_ip E 4 user_agent F 5 user_id G 

To simplify the link to parts of the DataFrame, give descriptive column and index names:

 In [530]: df.columns = ['field', 'value']; df.index.names = ['index', 'num']; df Out[530]: field value index num 0 0 status A 1 protocol B ... 

Now, if we move the field column to the index:

 In [531]: df = df.set_index('field', append=True); df Out[531]: value index num field 0 0 status A 1 protocol B 2 region_name C 3 datetime D ... 

and clear the num index level:

 In [532]: df.index = df.index.droplevel(level='num'); df Out[532]: value index field 0 status A protocol B region_name C datetime D ... 

then we can get a DataFrame of the desired shape by moving the index field to the column index:

 In [533]: df = df['value'].unstack(level=1); df Out[533]: field datetime protocol region_name status user_agent user_id user_ip index 0 DBCAFGE 1 DBCAFGE 2 DB None AFGE 
+5
source

You can use Pandas' vectorized string operations , specifically str.contains :

 import numpy as np # boolean index of rows to keep is_valid = np.ones(Clean_data.shape[0], np.bool) for column in Clean_Data.columns: # check whether rows contain this column name is_valid &= Clean_Data[column].str.contains(column) # drop rows where not all elements contain their respective column names Clean_Data.drop(np.where(~is_valid)[0], inplace=True) 
+4
source

I cannot add comments, so I will post this as a response (actually this is a comment in response to your comment about memory usage and runtime).

For large files (100 GB), you need to think that you are not going to read these files in memory. You can set the chunk size for pandas big data workflows using pandas or How to read 6 GB csv with pandas or use the output generator with csv module and read line by line / line by line. Reading huge .csv in python

Include a @unutbu comment on using a regular expression to sort records in columns, given that field names are clearly marked for each cell (i.e. r'(.*)=(.*)' Is all that is required), although it may it will take some error correction) should be all that you need (also, as they say, deleting whole rows due to some missing data is not a typical or recommended approach).

+3
source

All Articles