The fastest way to check if a string contains certain characters in any of the list items

Question

The fastest way to check if a string contains certain characters in any of the list items

What is the fastest way to check if a string contains any characters from any list items?

I am currently using this method:

lestring = "Text123" lelist = ["Text", "foo", "bar"] for x in lelist: if lestring.count(x): print 'Yep. "%s" contains characters from "%s" item.' % (lestring, x)

Is there a way to do this without iteration (which will make it faster, I suppose.)?

+8

performance python list iteration

Muhammad taha Jan 19 '13 at 6:01

source share

4 answers

 filter(lambda x: lestring.count(x), lelist)

This will return all the lines you are trying to find in a list.

+1

Joe Jan 19 '13 at 6:08

source share

if the test is to see if there are any common characters (not words or segments), create a set of letters in the list, and then check the letters again for a string:

 char_list = set(''.join(list_of_words)) test_set = set(string_to_teat) common_chars = char_list.intersection(test_set)

However, I assume that you are looking for just one character ...

0

theodox Jan 19 '13 at 6:20

source share

The esmre library does the trick. In your case it’s simpler, esm (part of esmre) is what you want.

https://pypi.python.org/pypi/esmre/

https://code.google.com/p/esmre/

They have good documentation and examples: From their examples:

 >>> import esm >>> index = esm.Index() >>> index.enter("he") >>> index.enter("she") >>> index.enter("his") >>> index.enter("hers") >>> index.fix() >>> index.query("this here is history") [((1, 4), 'his'), ((5, 7), 'he'), ((13, 16), 'his')] >>> index.query("Those are his sheep!") [((10, 13), 'his'), ((14, 17), 'she'), ((15, 17), 'he')] >>>

I conducted several performance tests:

 import random, timeit, string, esm def uz(lelist, lestring): for x in lelist: if lestring.count(x): return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x) def ab(lelist, lestring): return [e for e in lelist if e in lestring] def use_esm(index, lestring): return index.query(lestring) for TEXT_LEN in [5, 50, 1000]: for SEARCH_LEN in [5, 20]: for N in [5, 50, 1000, 10000]: if TEXT_LEN < SEARCH_LEN: continue print 'TEXT_LEN:', TEXT_LEN, 'SEARCH_LEN:', SEARCH_LEN, 'N:', N lestring = ''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(TEXT_LEN))) lelist = [''.join((random.choice(string.ascii_uppercase + string.digits) for _ in range(SEARCH_LEN))) for _ in range(N)] index = esm.Index() for i in lelist: index.enter(i) index.fix() t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab") t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz") t_esm = timeit.Timer("use_esm(index, lestring)", setup="from __main__ import index, lestring, use_esm") ab_time = t_ab.timeit(1000) uz_time = t_uz.timeit(1000) esm_time = t_esm.timeit(1000) min_time = min(ab_time, uz_time, esm_time) print ' ab%s: %f' % ('*' if ab_time == min_time else '', ab_time) print ' uz%s: %f' % ('*' if uz_time == min_time else '', uz_time) print ' esm%s %f:' % ('*' if esm_time == min_time else '', esm_time)

And the results obtained depend mainly on the number of items you are looking for (in my case, "N"):

 TEXT_LEN: 1000 SEARCH_LEN: 20 N: 5 ab*: 0.001733 uz: 0.002512 esm 0.126853: TEXT_LEN: 1000 SEARCH_LEN: 20 N: 50 ab*: 0.017564 uz: 0.023701 esm 0.079925: TEXT_LEN: 1000 SEARCH_LEN: 20 N: 1000 ab: 0.370371 uz: 0.489523 esm* 0.133783: TEXT_LEN: 1000 SEARCH_LEN: 20 N: 10000 ab: 3.678790 uz: 4.883575 esm* 0.259605:

0

moomima May 12, '14 at 10:32

source share

Abhijit · Accepted Answer · 2013-01-19T06:08:58+0000

You can check the membership verification list.

 >>> lestring = "Text123" >>> lelist = ["Text", "foo", "bar"] >>> [e for e in lelist if e in lestring] ['Text']

Compared to your implementation, although LC has an implicit loop, it is faster because there is no explicit function call, as in your case with count

Compared to the Joe implementation, your method is faster since the filter function will require calling two functions in a loop, lambda and count

 >>> def joe(lelist, lestring): return ''.join(random.sample(x + 'b'*len(x), len(x))) >>> def uz(lelist, lestring): for x in lelist: if lestring.count(x): return 'Yep. "%s" contains characters from "%s" item.' % (lestring, x) >>> def ab(lelist, lestring): return [e for e in lelist if e in lestring] >>> t_ab = timeit.Timer("ab(lelist, lestring)", setup="from __main__ import lelist, lestring, ab") >>> t_uz = timeit.Timer("uz(lelist, lestring)", setup="from __main__ import lelist, lestring, uz") >>> t_joe = timeit.Timer("joe(lelist, lestring)", setup="from __main__ import lelist, lestring, joe") >>> t_ab.timeit(100000) 0.09391469893125759 >>> t_uz.timeit(100000) 0.1528471407273173 >>> t_joe.timeit(100000) 1.4272649857800843

Jamie commented on the solution more slowly for shorter lines. Here is the test result

 >>> def jamie(lelist, lestring): return next(itertools.chain((e for e in lelist if e in lestring), (None,))) is not None >>> t_jamie = timeit.Timer("jamie(lelist, lestring)", setup="from __main__ import lelist, lestring, jamie") >>> t_jamie.timeit(100000) 0.22237164127909637

If you need boolean values, for shorter lines just change the specified LC expression

 [e in lestring for e in lelist if e in lestring]

Or for longer lines you can do the following

 >>> next(e in lestring for e in lelist if e in lestring) True

or

 >>> any(e in lestring for e in lelist)

The fastest way to check if a string contains certain characters in any of the list items

More articles: