How to calculate the numeric value of a string with unicode components in python?

According to my previous question How to convert Unicode characters to float in Python? I would like to find a more elegant solution to calculate the value of a string containing numeric values ​​in unicode.

For example, take the lines "1⅕" and "1 ⅕". I would like them to allow 1.2

I know that I can iterate over a string by character, check unicodedata.category (x) == "No" on each character, and convert unicode characters to unicodedata.numeric (x). Then I would have to split the string and sum the values. However, this seems rather hacky and erratic. Is there a more elegant solution for this in Python?

+4
source share
4 answers

I think this is what you want ...

import unicodedata def eval_unicode(s): #sum all the unicode fractions u = sum(map(unicodedata.numeric, filter(lambda x: unicodedata.category(x)=="No",s))) #eval the regular digits (with optional dot) as a float, or default to 0 n = float("".join(filter(lambda x:x.isdigit() or x==".", s)) or 0) return n+u 

or a “comprehensive” solution for those who prefer this style:

 import unicodedata def eval_unicode(s): #sum all the unicode fractions u = sum(unicodedata.numeric(i) for i in s if unicodedata.category(i)=="No") #eval the regular digits (with optional dot) as a float, or default to 0 n = float("".join(i for i in s if i.isdigit() or i==".") or 0) return n+u 

But beware, there are many unicode values ​​that don't seem to have the numeric value assigned in python (e.g. ⅜⅝ don't work ... or maybe this is just a question with my xD keyboard).

Another implementation note: “too reliable,” it will work even with distorted numbers, such as “123½3 ½”, and will evaluate it to 1234.0 ... but it will not work if there is more than one point.

+2
source
 >>> import unicodedata >>> b = '10 ⅕' >>> int(b[:-1]) + unicodedata.numeric(b[-1]) 10.2 define convert_dubious_strings(s): try: return int(s) except UnicodeEncodeError: return int(b[:-1]) + unicodedata.numeric(b[-1]) 

and if it may not contain the integer part, you need to add another additional try-except block.

+1
source

This may be enough for you, depending on the strange cases of the region you want to deal with:

 val = 0 for c in my_unicode_string: if unicodedata.category(unichr(c)) == 'No': cval = unicodedata.numeric(c) elif c.isdigit(): cval = int(c) else: continue if cval == int(cval): val *= 10 val += cval print val 

Integer digits are considered another digit in the number, fractional characters are considered fractions to add to the number. Does not do the right thing with spaces between numbers, repeating fractions, etc.

0
source

I think you'll need a regular expression that explicitly lists the characters you want to support. Not all numeric characters are suitable for the type of composition that you intend - for example, what should be the numeric value

 u"4\N{CIRCLED NUMBER FORTY TWO}2\N{SUPERSCRIPT SIX}" 

???

Do

 for i in range(65536): if unicodedata.category(unichr(i)) == 'No': print hex(i), unicodedata.name(unichdr(i)) 

and browse the list that you really want to maintain.

-1
source

All Articles