Pandas difference between strings and objects

Numpy seems to make a distinction between the str and object types. For example, I can do:

 >>> import pandas as pd >>> import numpy as np >>> np.dtype(str) dtype('S') >>> np.dtype(object) dtype('O') 

Where dtype ('S') and dtype ('O') correspond to str and object respectively.

However, pandas does not seem to have that distinction and forcing str to object . ::

 >>> df = pd.DataFrame({'a': np.arange(5)}) >>> df.a.dtype dtype('int64') >>> df.a.astype(str).dtype dtype('O') >>> df.a.astype(object).dtype dtype('O') 

Forcing the dtype('S') does not help either. ::

 >>> df.a.astype(np.dtype(str)).dtype dtype('O') >>> df.a.astype(np.dtype('S')).dtype dtype('O') 

Are there any explanations for this behavior?

+15
python numpy pandas
source share
2 answers

The numpy string descriptors are not python strings.

Consequently, pandas intentionally uses its own python strings, which require a dtype object.

First of all, let me demonstrate a little what I mean when numpy strings are different:

 In [1]: import numpy as np In [2]: x = np.array(['Testing', 'a', 'string'], dtype='|S7') In [3]: y = np.array(['Testing', 'a', 'string'], dtype=object) 

Now 'x' is a dtype numpy string (fixed width, c-like string) and y is an array of python's own strings.

If we try to go beyond 7 characters, we will see an immediate difference. String versions of dtype will be truncated:

 In [4]: x[1] = 'a really really really long' In [5]: x Out[5]: array(['Testing', 'a reall', 'string'], dtype='|S7') 

Whereas dtype versions of an object can be of arbitrary length:

 In [6]: y[1] = 'a really really really long' In [7]: y Out[7]: array(['Testing', 'a really really really long', 'string'], dtype=object) 

Further, |S dtype strings cannot properly contain Unicode, although there is also a dtype string with a fixed unicode length. For now, I will skip an example.

Finally, numpy strings are actually mutable, but Python strings are not. For example:

 In [8]: z = x.view(np.uint8) In [9]: z += 1 In [10]: x Out[10]: array(['Uftujoh', 'b!sfbmm', 'tusjoh\x01'], dtype='|S7') 

For all these reasons, pandas chose never to allow strings of type C with a fixed length as a data type. As you noticed, trying to force a python string to a fixed string with numpy will not work in pandas . Instead, it always uses its own python strings, which behave in a more intuitive way for most users.

+23
source share

How can you do a string comparison with objects then? For example, comparing two columns?

0
source share

All Articles