Finding the difference between two very large lists

I have two very large lists: one of them has a length of 331991, allows you to call one a, and the other - 99171, longer, call it b. I want to compare a with b, and then return a list of elements in a that are not in b. It should also be as efficient as possible and in the order in which they appear, which is probably given, but I thought I could also drop it there.

0
source share
3 answers

This can be done in O (m + n) time, where m and n correspond to the lengths of two lists:

exclude = set(b) # O(m) new_list = [x for x in a if x not in exclude] # O(n) 

The key here is that kits have ongoing containment tests. Perhaps you could consider b be a collection starting with.

See also: List of Understanding


Using your example :

 >>> a = ['a','b','c','d','e'] >>> b = ['a','b','c','f','g'] >>> >>> exclude = set(b) >>> new_list = [x for x in a if x not in exclude] >>> >>> new_list ['d', 'e'] 
+7
source

Suppose:

 book = ["once", "upon", "time", ...., "end", "of", "very", "long", "story"] dct = ["alfa", "anaconda", .., "zeta-jones"] 

And you want to remove from the list of books all the elements that are present in the dct.

Fast decision:

 short_story = [word in book if word not in dct] 

Speeding up a dct search: turning a dct into a set is a quick search:

 dct = set(dct) short_story = [word in book if word not in dct] 

If the book is very long and does not fit into memory, you can process it word for word. To do this, we can use the generator:

 def story_words(fname): """fname is name of text file with a story""" with open(fname) as f: for line in f: for word in line.split() yield word #print out shortened story for word in story_words("alibaba.txt"): if word not in dct: print word 

And in case your dictionary is too large, you will have to give up speed and repeat the content as well. But I'll skip this now.

+1
source

Here is one way to convert b into a set, and then filter the elements from a that aren't there:

 from itertools import ifilterfalse a = ['a','b','c','d','e'] b = ['a','b','c'] c = list(ifilterfalse(set(b).__contains__, a)) # ['d', 'e'] 
0
source

All Articles