The fastest way to get list merging is Python

There is a C ++ comparison for combining lists from list lists: The fastest way to find a union of sets

And there are a few more python related questions, but none of them offer the fastest way to merge lists:

  • Finding combining list of lists in Python
  • Python small list smoothing

From the answers, I realized that there are at least 2 ways to do this:

>>> from itertools import chain >>> x = [[1,2,3], [3,4,5], [1,7,8]] >>> list(set().union(*x)) [1, 2, 3, 4, 5, 7, 8] >>> list(set(chain(*x))) [1, 2, 3, 4, 5, 7, 8] 

Please note that I force the list after the list, because I need the list list to be fixed for further processing.

After some comparison, it seems that list(set(chain(*x))) more stable and takes less time:

 from itertools import chain import time import random # Dry run. x = [[random.choice(range(10000)) for i in range(10)] for j in range(10)] list(set().union(*x)) list(set(chain(*x))) y_time = 0 z_time = 0 for _ in range(1000): x = [[random.choice(range(10000)) for i in range(10)] for j in range(10)] start = time.time() y = list(set().union(*x)) y_time += time.time() - start #print 'list(set().union(*x)):\t', y_time start = time.time() z = list(set(chain(*x))) z_time += time.time() - start #print 'list(set(chain(*x))):\t', z_time assert sorted(y) == sorted(z) #print print y_time / 1000. print z_time / 1000. 

[exit]:

 1.39586925507e-05 1.09834671021e-05 

Taking out the variable of casting kits to the list:

 y_time = 0 z_time = 0 for _ in range(1000): x = [[random.choice(range(10000)) for i in range(10)] for j in range(10)] start = time.time() y = set().union(*x) y_time += time.time() - start start = time.time() z = set(chain(*x)) z_time += time.time() - start assert sorted(y) == sorted(z) print y_time / 1000. print z_time / 1000. 

[exit]:

 1.22241973877e-05 1.02684497833e-05 

Here's the full conclusion when I try to print intermediate timings (without listing the list): http://pastebin.com/raw/y3i6dXZ8

Why does this list(set(chain(*x))) take less time than list(set().union(*x)) ?

Is there any other way to achieve a unified list combining? Using numpy or pandas or sframe or something else? Is the alternative faster?

+8
python list set numpy union
source share
1 answer

The fastest depends on the nature of x - whether it is a long list or a short list, with many sublists or several sublists, whether the sublists are long or short and whether there are many duplicates or several duplicates.

Below are some results of time comparing some alternatives. There are so many possibilities that this is far from a complete analysis, but perhaps it will give you the basis for exploring your use case.

 func | x | time unique_concatenate | many_uniques | 0.863 empty_set_union | many_uniques | 1.191 short_set_union_rest | many_uniques | 1.192 long_set_union_rest | many_uniques | 1.194 set_chain | many_uniques | 1.224 func | x | time long_set_union_rest | many_duplicates | 0.958 short_set_union_rest | many_duplicates | 0.969 empty_set_union | many_duplicates | 0.971 set_chain | many_duplicates | 1.128 unique_concatenate | many_duplicates | 2.411 func | x | time empty_set_union | many_small_lists | 1.023 long_set_union_rest | many_small_lists | 1.028 set_chain | many_small_lists | 1.032 short_set_union_rest | many_small_lists | 1.036 unique_concatenate | many_small_lists | 1.351 func | x | time long_set_union_rest | few_large_lists | 0.791 empty_set_union | few_large_lists | 0.813 unique_concatenate | few_large_lists | 0.814 set_chain | few_large_lists | 0.829 short_set_union_rest | few_large_lists | 0.849 

Be sure to run time tests on your machine, as the results may vary.


 from __future__ import print_function import random import timeit from itertools import chain import numpy as np def unique_concatenate(x): return np.unique(np.concatenate(x)) def short_set_union_rest(x): # This assumes x[0] is the shortest list in x return list(set(x[0]).union(*x[1:])) def long_set_union_rest(x): # This assumes x[-1] is the longest list in x return list(set(x[-1]).union(*x[1:])) def empty_set_union(x): return list(set().union(*x)) def set_chain(x): return list(set(chain(*x))) big_range = list(range(10**7)) small_range = list(range(10**5)) many_uniques = [[random.choice(big_range) for i in range(j)] for j in range(10, 10000, 10)] many_duplicates = [[random.choice(small_range) for i in range(j)] for j in range(10, 10000, 10)] many_small_lists = [[random.choice(big_range) for i in range(10)] for j in range(10, 10000, 10)] few_large_lists = [[random.choice(big_range) for i in range(1000)] for j in range(10, 100, 10)] if __name__=='__main__': for x, n in [('many_uniques', 1), ('many_duplicates', 4), ('many_small_lists', 800), ('few_large_lists', 800)]: timing = dict() for func in [ 'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest', 'empty_set_union', 'set_chain']: timing[func, x] = timeit.timeit( '{}({})'.format(func, x), number=n, setup='from __main__ import {}, {}'.format(func, x)) print('{:20} | {:20} | {}'.format('func', 'x', 'time')) for key, t in sorted(timing.items(), key=lambda item: item[1]): func, x = key print('{:20} | {:20} | {:.3f}'.format(func, x, t)) print(end='\n') 
+7
source share

All Articles