Separate all non-numeric characters (except ".") From a string in Python

Question

Separate all non-numeric characters (except ".") From a string in Python

I have very good working code, but I was wondering if anyone has any better suggestions on how to do this:

val = ''.join([c for c in val if c in '1234567890.'])

What would you do?

+53

python

adam Jun 03 '09 at 23:12

source share

6 answers

Here is a sample code:

 $ cat a.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join([c for c in a if c in '1234567890.'])

 $ cat b.py import re non_decimal = re.compile(r'[^\d.]+') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): non_decimal.sub('', a)

 $ cat c.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join([c for c in a if c.isdigit() or c == '.'])

 $ cat d.py a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): b = [] for c in a: if c.isdigit() or c == '.': continue b.append(c) ''.join(b)

And the synchronization results:

 $ time python a.py real 0m24.735s user 0m21.049s sys 0m0.456s $ time python b.py real 0m10.775s user 0m9.817s sys 0m0.236s $ time python c.py real 0m38.255s user 0m32.718s sys 0m0.724s $ time python d.py real 0m46.040s user 0m41.515s sys 0m0.832s

It seems that regex is a winner.

Personally, I find regular expression as easy to read as list comprehension. If you do this just a few times, then you will probably be more affected by regular expression compilation. Do what jives with your code and coding style.

+13

Colin Burnett Jun 03 '09 at 23:44

source share

Another "pythonic" approach

filter( lambda x: x in '0123456789.', s )

but regex is faster.

+13

maxp Jun 04 '09 at 6:24

source share

My solution is simpler using regex:

 import re re.sub("[^0-9^.]", "", data)

+5

Midhun Mohan Feb 22 '16 at 11:34

source share

 import string filter(lambda c: c in string.digits + '.', s)

+3

minism Jan 03 '12 at 21:15

source share

If the character set was larger, using the sets as shown below may be faster. Be that as it may, it is a bit slower than a.py.

 dec = set('1234567890.') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join(ch for ch in a if ch in dec)

At least on my system you can save a tiny bit of time (and memory if your string was long enough to make a difference) using the generator expression instead of understanding the list in a.py:

 a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): ''.join(c for c in a if c in '1234567890.')

Oh, and here is the fastest way I found on this test line (much faster than a regular expression) if you do this many times and are willing to put up with the overhead of creating a pair of table characters.

 chrs = ''.join(chr(i) for i in xrange(256)) deletable = ''.join(ch for ch in chrs if ch not in '1234567890.') a = '27893jkasnf8u2qrtq2ntkjh8934yt8.298222rwagasjkijw' for i in xrange(1000000): a.translate(chrs, deletable)

On my system, this works in ~ 1.0 second when the b.py regular expression runs in ~ 4.3 seconds.

+2

Jun 04 '09 at 16:49

source share

Miles · Accepted Answer · 2009-06-03 23:14

You can use regex (using the re module) to accomplish the same thing. The example below corresponds to the runs [^\d.] (Any character that is not a decimal digit or period) and replaces them with an empty string. Note that if the template is compiled with the UNICODE flag, the resulting string may still include non-ASCII numbers . In addition, the result after deleting "non-numeric" characters is not necessarily a valid number.

 >>> import re >>> non_decimal = re.compile(r'[^\d.]+') >>> non_decimal.sub('', '12.34fe4e') '12.344'

Separate all non-numeric characters (except ".") From a string in Python

More articles: