Replacing empty values (space) with NaN in pandas

Question

Replacing empty values (space) with NaN in pandas

I want to find all the values in a Pandas data frame that contain spaces (any arbitrary amount) and replace these values with NaN.

Any ideas how this can be improved?

Basically I want to include this:

ABC 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz 2000-01-05 -0.222552 4 2000-01-06 -1.176781 qux

In it:

  ABC 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz NaN 2000-01-05 -0.222552 NaN 4 2000-01-06 -1.176781 qux NaN

I managed to do this with the code below, but man is ugly. This is not Pythonic, and I'm sure this is not the most efficient use of pandas either. I iterate over each column and perform a logical replacement for the column mask created using a function that searches for regular expressions for each value, matching it with spaces.

 for i in df.columns: df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

This could be slightly optimized by looking only at fields that may contain empty lines:

 if df[i].dtype == np.dtype('object')

But that is not much improvement.

And finally, this code sets the target lines to None, which works with Pandas functions like fillna() , but it would be nice for completeness if I could insert NaN directly instead of None .

+123

python pandas dataframe

Chris Clark Nov 18 '12 at 22:22

source share

12 answers

If you want to replace the empty string and entries with spaces only, the correct answer is:!

 df = df.replace(r'^\s*$', np.nan, regex=True)

Accepted answer

 df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, You can try yourself with a slightly updated example:

 df = pd.DataFrame([ [-0.532681, 'foo', 0], [1.490752, 'bar', 1], [-1.387326, 'fo o', 2], [0.814772, 'baz', ' '], [-0.222552, ' ', 4], [-1.176781, 'qux', ''], ], columns='AB C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Also note that "fo o" is not replaced with Nan, although it contains a space. Further note that this is simple:

 df.replace(r'', np.NaN)

Doesn’t work either - try it.

+37

Philipp Schwarz Dec 14 '17 at 10:20

source share

What about:

 d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to each cell of the information frame.

+32

BrenBarn Nov 18 '12 at 23:15

source share

I will do the following:

 df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

 df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can delete the entire line and then replace the empty line with np.nan .

+14

Xiaorong Liao Apr 29 '16 at 9:34

source share

The simplest of all solutions:

 df = df.replace(r'^\s+$', np.nan, regex=True)

+6

Gil Baggio Mar 22 '18 at 14:44

source share

If you export data from a CSV file, it could be so simple:

 df = pd.read_csv(file_csv, na_values=' ')

This will create a data frame and also replace empty values like Na

+5

ibrahim rupawala Jan 07 '18 at 16:07

source share

For a very quick and easy solution where you check equality against a single value, you can use the mask method.

 df.mask(df == ' ')

+1

Ted Petrou Nov 03 '17 at 22:48

source share

You can also use a filter to do this.

 df = PD.DataFrame([ [-0.532681, 'foo', 0], [1.490752, 'bar', 1], [-1.387326, 'foo', 2], [0.814772, 'baz', ' '], [-0.222552, ' ', 4], [-1.176781, 'qux', ' ']) df[df=='']='nan' df=df.astype(float)

+1

ERIC Feb 01 '18 at 10:14

source share

 print(df.isnull().sum()) # check numbers of null value in each column modifiedDf=df.fillna("NaN") # Replace empty/null values with "NaN" # modifiedDf = fd.dropna() # Remove rows with empty values print(modifiedDf.isnull().sum()) # check numbers of null value in each column

0

Jayantha Sep 29 '18 at 20:31

source share

This is not an elegant solution, but it seems that saving to XLSX works and then importing it back. Other solutions on this page did not help me, I don’t know why.

 data.to_excel(filepath, index=False) data = pd.read_excel(filepath)

0

David Kong Jan 14 '19 at 5:02

source share

All of them are close to the correct answer, but I would not say that this will solve the problem, remaining the most readable for others reading your code. I would say that the answer is a combination of the BrenBarn answer and the tuomasttik comment under that answer . BrenBarn's answer uses built-in isspace , but it does not support deleting blank lines as requested by the OP, and I would attribute this to the standard isspace replacing strings with zero.

I rewrote it with .apply so you can call it on pd.Series or pd.DataFrame .

Python 3:

To replace empty lines or lines with full spaces:

 df = df.apply(lambda x: np.nan if isinstance(x, str) and (x.isspace() or not x) else x)

To replace strings with full spaces:

 df = df.apply(lambda x: np.nan if isinstance(x, str) and x.isspace() else x)

To use this in Python 2, you need to replace str with basestring .

Python 2:

To replace empty lines or lines with full spaces:

 df = df.apply(lambda x: np.nan if isinstance(x, basestring) and (x.isspace() or not x) else x)

To replace strings with full spaces:

 df = df.apply(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

0

spen.smith May 12 '19 at 4:05

source share

I tried this code and it worked for me: df.applymap (lambda x: "NaN", if x == "" otherwise x)

-one

Matt Naj Aug 07 '19 at 20:53 on

source share

patricksurry · Accepted Answer · 2014-02-21 18:48

I think df.replace() does the job:

 df = pd.DataFrame([ [-0.532681, 'foo', 0], [1.490752, 'bar', 1], [-1.387326, 'foo', 2], [0.814772, 'baz', ' '], [-0.222552, ' ', 4], [-1.176781, 'qux', ' '], ], columns='AB C'.split(), index=pd.date_range('2000-01-01','2000-01-06')) # replace field that entirely space (or empty) with NaN print(df.replace(r'^\s*$', np.nan, regex=True))

Produces:

  ABC 2000-01-01 -0.532681 foo 0 2000-01-02 1.490752 bar 1 2000-01-03 -1.387326 foo 2 2000-01-04 0.814772 baz NaN 2000-01-05 -0.222552 NaN 4 2000-01-06 -1.176781 qux NaN

As Temak pointed out, use df.replace(r'^\s+$', np.nan, regex=True) if your actual data contains spaces.

Replacing empty values ​​(space) with NaN in pandas

More articles:

Replacing empty values (space) with NaN in pandas