Python Pandas digitized column output

I am reading JSON files in dataframes. The data frame may contain some columns of type String (object), some Numeric (int64 and / or float64), and some columns of type datetime. When data is read, the data type is often incorrect (for example, datetime, int, and float are often stored as an object type). I want to report this opportunity. (i.e., the column is in the data frame as an “object” (String), but in fact it is “date-time”).

The problem is that when I use pd.to_numeric and pd.to_datetime , they will evaluate and try to convert the column, and many times this ends up depending on which of the two I call the last ... (I was going to use convert_objects ( ) , which works, but which depreciates, so we need a better option).

The code I use to evaluate the dataframe column (I understand that many of the following are redundant, but I wrote this for reading):

try: inferred_type = pd.to_datetime(df[Field_Name]).dtype if inferred_type == "datetime64[ns]": inferred_type = "DateTime" except: pass try: inferred_type = pd.to_numeric(df[Field_Name]).dtype if inferred_type == int: inferred_type = "Integer" if inferred_type == float: inferred_type = "Float" except: pass 
+7
python profiling pandas
source share
5 answers

Alternatively: Pandas allows you to explain data types when creating a data frame. You pass a dictionary with column names as a key and the required data type as a value.

Documentation here for standard constructor

Or you can specify the type of column after importing into the data frame

for example: df['field_name'] = df['field_name'].astype(np.date_time)

+1
source share

Try for example

 df['field_name'] = df['field_name'].astype(np.float64) 

(assuming import numpy as np )

0
source share

One solution to get it to output dtypes is to get it to write data to CSV using StringIO and then read it.

0
source share

I had to face the same problem as for determining the types of columns for data input, where the type is not known in advance ... from the db read in my case. Could not find a good answer here on SO, or by looking at the pandas source code. Selected it using this function:

 def _get_col_dtype(col): """ Infer datatype of a pandas column, process only if the column dtype is object. input: col: a pandas Series representing a df column. """ if col.dtype =="object": # try numeric try: col_new = pd.to_datetime(col.dropna().unique()) return col_new.dtype except: try: col_new = pd.to_numeric(col.dropna().unique()) return col_new.dtype except: try: col_new = pd.to_timedelta(col.dropna().unique()) return col_new.dtype except: return "object" else: return col.dtype 
0
source share

Deep in the Pandas API, there is actually a function that does half the decent job.

 import pandas as pd infer_type = lambda x: pd.api.types.infer_dtype(x, skipna=True) df.apply(infer_type, axis=0) # DataFrame with column names & new types df_types = pd.DataFrame(df.apply(pd.api.types.infer_dtype, axis=0)).reset_index().rename(columns={'index': 'column', 0: 'type'}) 

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.infer_dtype.html#pandas.api.types.infer_dtype

0
source share

All Articles