Is it faster for connection sets or checks the entire list for duplicate?

Question

Is it faster for connection sets or checks the entire list for duplicate?

Sorry for the poorly worded title, but I asked about getting a unique list of items from two lists. People told me to make a list → sets, and then combine.

So now I am wondering if this is faster:

When adding a single item to the list, view the entire list for duplicates.
Make one element a set, and then sets of unions.

I probably should just read the sets retroactively ...

In Python, by the way, it’s a pity that I didn’t specify it.

+6

python set

funk-shun Jan 12 '11 at 21:22

source share

4 answers

I really like the virhilo approach, but this is a rather specific set of data that it tested. In all of this, do not just check the functions, but check them how you do it. I put together a much more comprehensive test suite. It performs each function you specify (using a small decorator) through a list of comparisons and determines how long each function takes, and therefore how slower it is. As a result, it is not always clear what function you should perform without knowing more about the size, overlap, and type of your data.

Here is my test program, below will be the output.

 from timeit import Timer from copy import copy import random import sys funcs = [] class timeMe(object): def __init__(self, f): funcs.append(f) self.f = f def __call__(self, *args, **kwargs): return self.f(*args, **kwargs) @timeMe def extend_list_then_set(input1, input2): """ extending one list by another end then remove duplicates by making set """ l1 = copy(input1) l2 = copy(input2) l1.extend(l2) set(l1) @timeMe def per_element_append_to_list(input1, input2): """ checking if element is on one list end adding it only if not """ l1 = copy(input1) l2 = copy(input2) for elem in l2: if elem not in l1: l1.append(elem) @timeMe def union_sets(input1, input2): """ making sets from both lists and then union from them """ l1 = copy(input1) l2 = copy(input2) set(l1) | set(l2) @timeMe def set_from_one_add_from_two(input1, input2): """ make set from list 1, then add elements for set 2 """ l1 = copy(input1) l2 = copy(input2) l1 = set(l1) for element in l2: l1.add(element) @timeMe def set_from_one_union_two(input1, input2): """ make set from list 1, then union list 2 """ l1 = copy(input1) l2 = copy(input2) x = set(l1).union(l2) @timeMe def chain_then_set(input1, input2): """ chain l1 & l2, then make a set out of that """ l1 = copy(input1) l2 = copy(input2) set(itertools.chain(l1, l2)) def run_results(l1, l2, times): for f in funcs: t = Timer('%s(l1, l2)' % f.__name__, 'from __main__ import %s; l1 = %s; l2 = %s' % (f.__name__, l1, l2)) yield (f.__name__, t.timeit(times)) test_datasets = [ ('original, small, some overlap', range(200), range(150, 250), 10000), ('no overlap: l1 = [1], l2 = [2..100]', [1], range(2, 100), 10000), ('lots of overlap: l1 = [1], l2 = [1]*100', [1], [1]*100, 10000), ('50 random ints below 2000 in each', [random.randint(0, 2000) for x in range(50)], [random.randint(0, 2000) for x in range(50)], 10000), ('50 elements in each, no overlap', range(50), range(51, 100), 10000), ('50 elements in each, total overlap', range(50), range(50), 10000), ('500 random ints below 500 in each', [random.randint(0, 500) for x in range(500)], [random.randint(0, 500) for x in range(500)], 1000), ('500 random ints below 2000 in each', [random.randint(0, 2000) for x in range(500)], [random.randint(0, 2000) for x in range(500)], 1000), ('500 random ints below 200000 in each', [random.randint(0, 200000) for x in range(500)], [random.randint(0, 200000) for x in range(500)], 1000), ('500 elements in each, no overlap', range(500), range(501, 1000), 10000), ('500 elements in each, total overlap', range(500), range(500), 10000), ('10000 random ints below 200000 in each', [random.randint(0, 200000) for x in range(10000)], [random.randint(0, 200000) for x in range(10000)], 50), ('10000 elements in each, no overlap', range(10000), range(10001, 20000), 10), ('10000 elements in each, total overlap', range(10000), range(10000), 10), ('original lists 100 times', range(200)*100, range(150, 250)*100, 10), ] fullresults = [] for description, l1, l2, times in test_datasets: print "Now running %s times: %s" % (times, description) results = list(run_results(l1, l2, times)) speedresults = [x for x in sorted(results, key=lambda x: x[1])] for name, speed in results: finish = speedresults.index((name, speed)) + 1 timesslower = speed / speedresults[0][1] fullresults.append((description, name, speed, finish, timesslower)) print '\t', finish, ('%.2fx' % timesslower).ljust(10), name.ljust(40), speed print import csv out = csv.writer(sys.stdout) out.writerow(('Test', 'Function', 'Speed', 'Place', 'timesslower')) out.writerows(fullresults)

results

My point is to encourage you to test your data, so I do not want to describe the specifics. However ... The first extension method is the fastest middle method, but set_from_one_union_two ( x = set(l1).union(l2) ) wins several times. You can get more details if you run the script yourself.

The numbers I'm reporting are this number of times slower than this function than the most complete function of this test. If it were the fastest, it would be 1.

  Functions Tests extend_list_then_set per_element_append_to_list set_from_one_add_from_two set_from_one_union_two union_sets chain_then_set original, small, some overlap 1 25.04 1.53 1.18 1.39 1.08 no overlap: l1 = [1], l2 = [2..100] 1.08 13.31 2.10 1 1.27 1.07 lots of overlap: l1 = [1], l2 = [1]*100 1.10 1.30 2.43 1 1.25 1.05 50 random ints below 2000 in each 1 7.76 1.35 1.20 1.31 1 50 elements in each, no overlap 1 9.00 1.48 1.13 1.18 1.10 50 elements in each, total overlap 1.08 4.07 1.64 1.04 1.41 1 500 random ints below 500 in each 1.16 68.24 1.75 1 1.28 1.03 500 random ints below 2000 in each 1 102.42 1.64 1.43 1.81 1.20 500 random ints below 200000 in each 1.14 118.96 1.99 1.52 1.98 1 500 elements in each, no overlap 1.01 145.84 1.86 1.25 1.53 1 500 elements in each, total overlap 1 53.10 1.95 1.16 1.57 1.05 10000 random ints below 200000 in each 1 2588.99 1.73 1.35 1.88 1.12 10000 elements in each, no overlap 1 3164.01 1.91 1.26 1.65 1.02 10000 elements in each, total overlap 1 1068.67 1.89 1.26 1.70 1.05 original lists 100 times 1.11 2068.06 2.03 1 1.04 1.17 Average 1.04 629.25 1.82 1.19 1.48 1.06 Standard Deviation 0.05 1040.76 0.26 0.15 0.26 0.05 Max 1.16 3164.01 2.43 1.52 1.98 1.20

+8

chmullig Jan 13 '11 at 3:08

source share

The fastest thing you can do is build two sets of lists and accept combining them. Both sets of builds from the list and union sets are implemented at runtime in a very optimized C, so it is very fast.

In code, if lists are l1 and l2 , you can do

 unique_elems = set(l1) | set(l2)

EDIT: as @kriss notes, extending l1 with l2 is faster. However, this code does not change l1 and also works if l1 and l2 are universal iterators.

+1

Giuseppe ottaviano Jan 12 '11 at 21:29

source share

It all depends on what you have as input and want as output.

If you have the li list at the beginning and want to get the modified list at the end, then the faster method if not elt in li: li.append(elt) problem is converting the original list to a set, and then converting back to a list that is way too slow.

But if you can work with set s at any time (you don't need the list order, and the methods that get it just need some iterable), just doing s.add(elt) is faster.

If at the beginning you need a list and want to get the list at the end, even with the final conversion from the list installed in the list, it’s faster to manage the unity of elements with sets, but you can easily check by @virhilo in it than concatenating two lists with using the extension, then converting the result for the installation is faster than converting two lists into a set and performing the union.

I don’t know exactly what the limitations of your programs are, but if unity is as important as it seems, and if preserving the insertion order is not required, it will be useful for you to use sets at all times, never changing them to lists. Most algorithms will work anyway, thanks to Duck Typing, as both are different types of iterations.

+1

kriss Jan 12 '11 at 22:14

source share

virhilo · Accepted Answer · 2011-01-12T21:31:46+0000

since you can see that one of them is being forwarded to the other end, then remove duplicates, making the set this is the fastest way (at least in python;))

>>> def foo(): ... """ ... extending one list by another end then remove duplicates by making set ... """ ... l1 = range(200) ... l2 = range(150, 250) ... l1.extend(l2) ... set(l1) ... >>> def bar(): ... """ ... checking if element is on one list end adding it only if not ... """ ... l1 = range(200) ... l2 = range(150, 250) ... for elem in l2: ... if elem not in l1: ... l1.append(elem) ... >>> def baz(): ... """ ... making sets from both lists and then union from them ... """ ... l1 = range(200) ... l2 = range(150, 250) ... set(l1) | set(l2) ... >>> from timeit import Timer >>> Timer(foo).timeit(10000) 0.265153169631958 >>> Timer(bar).timeit(10000) 7.921358108520508 >>> Timer(baz).timeit(10000) 0.3845551013946533 >>>

Is it faster for connection sets or checks the entire list for duplicate?

results

More articles: