How to replace string values ​​in pandas dataframe with integers?

I have a Pandas DataFrame that contains several string values. I want to replace them with integer values ​​in order to calculate the similarities. For instance:

stores[['CNPJ_Store_Code','region','total_facings']].head() Out[24]: CNPJ_Store_Code region total_facings 1 93209765046613 Geo RS/SC 1.471690 16 93209765046290 Geo RS/SC 1.385636 19 93209765044084 Geo PR/SPI 0.217054 21 93209765044831 Geo RS/SC 0.804633 23 93209765045218 Geo PR/SPI 0.708165 

and I want to replace the region == 'Geo RS / SC' ==> 1, the region == 'Geo PR / SPI' ==> 2, etc.

Clarification: I want to make a replacement automatically without first creating a dictionary, since I do not know in advance what my regions will be. Any ideas? I am trying to use DictVectorizer, without success.

I am sure there is a way to do this in a reasonable way, but I just can't find it.

Is anyone familiar with the solution?

+4
source share
3 answers

You can use the .apply() function and the dictionary to match all known string values ​​to their corresponding integer values:

 region_dictionary = {'Geo RS/SC': 1, 'Geo PR/SPI' : 2, .... } stores['region'] = stores['region'].apply(lambda x: region_dictionary[x]) 
+4
source

It seems to me that you really would like panda categories

http://pandas-docs.imtqy.com/pandas-docs-travis/categorical.html

It seems to me that you just need to change the dtype of your text column to "category" and you are done.

 stores['region'] = stores["region"].astype('category') 
+4
source

You can do:

 df = pd.read_csv(filename, index_col = 0) # Assuming it a csv file. def region_to_numeric(a): if a == 'Geo RS/SC': return 1 if a == 'Geo PR/SPI': return 2 df['region_num'] = df['region'].apply(region_to_numeric) 
0
source

All Articles