What is the difference between dtype and converter in pandas.read_csv

I know that there is a read_csv () function that pandas supports to read a CSV file. His documentation here

According to the documentation, we knew

dtype: Type or type of column -> type, default No Data type for data or columns. For example. {'A: np.float64,' b: np.int32} (Not supported using engine = python)

and

converters: dict, default None Dict functions for converting values ​​in specific columns. Keys can be integer or label columns

I would like to use this function, I can call pandas.read_csv('file',dtype=object) or pandas.read_csv('file',converters=object) . Obviously, the converter, its name may indicate that the data type will be converted, but I wonder what type of dtype.

Could you help me? Thanks.

+8
python pandas
source share
1 answer

The semantic difference is that dtype allows dtype to specify how to handle values, for example, as numeric or string.

Converters allow you to parse input data to convert it to the desired dtype type using a conversion function, such as parsing a string value into datetime or some other desired dtype type.

Here we see pandas trying to sniff types:

 In [2]: df = pd.read_csv(io.StringIO(t)) t="""int,float,date,str 001,3.31,2015/01/01,005""" df = pd.read_csv(io.StringIO(t)) df.info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null object str 1 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 40.0+ bytes 

It can be seen from the above that 001 and 005 treated as int64 , but the date string remains as str .

If we say that everything is object , then essentially everything is str :

 In [3]: df = pd.read_csv(io.StringIO(t), dtype=object).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null object date 1 non-null object str 1 non-null object dtypes: object(4) memory usage: 40.0+ bytes 

Here we force the int str column and tell parse_dates to use date_parser to parse the date column:

 In [6]: pd.read_csv(io.StringIO(t), dtype={'int':'object'}, parse_dates=['date']).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null object float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 40.0+ bytes 

Similarly, we could pass the to_datetime function to convert dates:

 In [5]: pd.read_csv(io.StringIO(t), converters={'date':pd.to_datetime}).info() <class 'pandas.core.frame.DataFrame'> Int64Index: 1 entries, 0 to 0 Data columns (total 4 columns): int 1 non-null int64 float 1 non-null float64 date 1 non-null datetime64[ns] str 1 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(2) memory usage: 40.0 bytes 
+7
source share

All Articles