Separate all non-numeric characters (except ".") From a string in Python

I have very good working code, but I was wondering if anyone has any better suggestions on how to do this:

val = ''.join([c for c in val if c in '1234567890.']) 

What would you do?

+53
python
Jun 03 '09 at 23:12
source share
6 answers

You can use regex (using the re module) to accomplish the same thing. The example below corresponds to the runs [^\d.] (Any character that is not a decimal digit or period) and replaces them with an empty string. Note that if the template is compiled with the UNICODE flag, the resulting string may still include non-ASCII numbers . In addition, the result after deleting "non-numeric" characters is not necessarily a valid number.

 >>> import re >>> non_decimal = re.compile(r'[^\d.]+') >>> non_decimal.sub('', '12.34fe4e') '12.344' 
+105
Jun 03 '09 at 23:14
source share

Here is a sample code:

 $ cat a.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join([c for c in a if c in '1234567890.']) 



 $ cat b.py import re non_decimal = re.compile(r'[^\d.]+') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): non_decimal.sub('', a) 



 $ cat c.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join([c for c in a if c.isdigit() or c == '.']) 



 $ cat d.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): b = [] for c in a: if c.isdigit() or c == '.': continue b.append(c) ''.join(b) 

And the synchronization results:




 $ time python a.py real 0m24.735s user 0m21.049s sys 0m0.456s $ time python b.py real 0m10.775s user 0m9.817s sys 0m0.236s $ time python c.py real 0m38.255s user 0m32.718s sys 0m0.724s $ time python d.py real 0m46.040s user 0m41.515s sys 0m0.832s 



It seems that regex is a winner.

Personally, I find regular expression as easy to read as list comprehension. If you do this just a few times, then you will probably be more affected by regular expression compilation. Do what jives with your code and coding style.

+13
Jun 03 '09 at 23:44
source share

Another "pythonic" approach

filter( lambda x: x in '0123456789.', s )

but regex is faster.

+13
Jun 04 '09 at 6:24
source share

My solution is simpler using regex:

 import re re.sub("[^0-9^.]", "", data) 
+5
Feb 22 '16 at 11:34
source share
 import string filter(lambda c: c in string.digits + '.', s) 
+3
Jan 03 '12 at 21:15
source share

If the character set was larger, using the sets as shown below may be faster. Be that as it may, it is a bit slower than a.py.

 dec = set('1234567890.') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join(ch for ch in a if ch in dec) 

At least on my system you can save a tiny bit of time (and memory if your string was long enough to make a difference) using the generator expression instead of understanding the list in a.py:

 a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join(c for c in a if c in '1234567890.') 

Oh, and here is the fastest way I found on this test line (much faster than a regular expression) if you do this many times and are willing to put up with the overhead of creating a pair of table characters.

 chrs = ''.join(chr(i) for i in xrange(256)) deletable = ''.join(ch for ch in chrs if ch not in '1234567890.') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): a.translate(chrs, deletable) 

On my system, this works in ~ 1.0 second when the b.py regular expression runs in ~ 4.3 seconds.

+2
Jun 04 '09 at 16:49
source share



All Articles