Pandas.read_html does not support decimal point

I read the xlm file using pandas.read_html and working almost perfectly , the problem is that the file has commas as decimal separators instead of read_html (by default in read_html ).

I could easily replace commas with dots in a single file, but I have almost 200 files with this configuration. with pandas.read_csv you can define a decimal separator, but I don't know why in pandas.read_html you can only define a thousands separator.

Any guidance on this ?, is there another way to automate the comma / dot replacement before it is opened with pandas? thanks in advance!

+6
source share
3 answers

Thanks @zhqiat. I think updating pandas to version 0.19 will solve the problem. Unfortunately, I could not find an easy way to achieve this. I found a tutorial on updating Pandas, but for ubuntu (user winXP).

I finally chose a workaround using the method posted here , basically converting all columns one by one to the pandas.Series numeric type

 result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",",".")) 

I know that this solution is not the best, but it works. Thanks

+2
source

Looking at the read_html source code

 def read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, tupleize_cols=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True): 

The function header implies that the function call has a decimal separator.

Further in the documentation, it looks like it was added in version 0.19 (so a little further down the experimental branch). Can you update your pandas?

decimal: str, default '.' A character defined as a decimal point (for example, using "," for European data). .. versionadded :: 0.19.0

+3
source

I am using pandas 0.19, but still it is not possible to correctly convert the numbers.

For instance:

 a=pd.read_html(r.text,thousands='.',decimal=',') 

recognizes the value "1.401.40" in the table cell as 140140 (float).

I use a similar solution like "Pablo A", just fixing nan values:

 def to_numeric_comma(series): new=series.apply(lambda x: str(x).replace('.','').replace(',','.')) new=pd.to_numeric(new.replace('nan',pd.np.nan)) return new 
0
source

All Articles