Python: A quick and minimal way to pin and match matching items in two lists

I have:

>>> As = [1, 2, 5, 6] >>> Bs = [2, 3, 4, 5] 

I need something like zip_fn below:

 >>> Rs = zip_fn(As, Bs, cmp) >>> Rs [(1, None), (2, 2), (None, 3), (None, 4), (5, 5), (6, None)] 

In other words, given the two arbitrary sequences As and Bs , I want to create a list of Rs tuples so that options that satisfy cmp(a, b) == 0 are combined into their own set (a, b) , but those that are in As and Bs without coincidence, are displayed as (a, None) and (None, b) respectively.

Some moments:

  • Duplicates in As or Bs don't bother me, as they won't be.
  • Rs may be an iterator producing the same sequence.
  • The order of Rs does not matter.

I have already implemented something that satisfies a functional requirement using a straightforward preliminary loop, but these are approximately 30 lines. I am looking for something that makes better use of the built-in or itertools esque libraries for shorter code and faster (C native).

Edit:

I should have made it clearer. Although I used the list of primes in the above example for brevity, the elements I actually work with are tuples, and cmp only checks for part of the tuple for equality. Perhaps it would be easier to think of the elements as entries and cmp as something that matches key fields. I could wrap the elements in the class and make it hashed by key, but this configuration is not required for anything else, so any solution that requires this will receive this as additional code and runtime overhead.

To summarize this as additional points to the above:

  • It is vital that cmp be used for comparison, as this is not a test for basic equality.
  • As a result, [(a, b)] , a must be the same instance as one of the elements in As and b the same instance of one of the elements in Bs if they are not None .
  • Elements in As and Bs not hashed.
+4
source share
5 answers

I decided to go with something like @ thg435 and use deque (change: changed to list as @EOL pointed out), which helped me cut a lot of code in what I originally had.

I chose this for the following reasons:

  • The calculation of time complexity is simplified and quite simple for forecasting.
  • This did not require an interface change.
  • I felt that there was a good balance between the amount of code and the expected performance.

The following is done:

 def izip_pairs(xs, ys, cmp): xs = list(reversed(sorted(xs, cmp))) ys = list(reversed(sorted(ys, cmp))) while xs or ys: delta = ((not xs) - (not ys)) or cmp(xs[-1], ys[-1]) x = xs.pop() if delta <= 0 else None y = ys.pop() if delta >= 0 else None yield x, y 

For comparison:

Inspired by all the answer sets, and for comparison, I changed the interface to host and implement a solution based on a hash search.

 def izip_keys(xs, ys, key): xs = {key(x): x for x in xs} ys = {key(y): y for y in ys} for k in set(xs.keys() + ys.keys()): yield xs.get(k, None), ys.get(k, None) 

Difference in time:

My conclusion about the following results is that the loop for sorted lists scales much better for more items, while the hash search only works for small Ns.

 >>> xs = range(100, 200) >>> ys = range(150, 250) >>> xs = map(lambda n: tuple(range(n, n + 10)), xs) >>> ys = map(lambda n: tuple(range(n, n + 10)), ys) >>> def profile_pairing(): ... list(izip_pairs(xs, ys, lambda x, y: cmp(x[:4], y[:4]))) ... >>> def profile_keying(): ... list(izip_keys(xs, ys, lambda v: v[:4])) ... >>> from timeit import Timer >>> Timer(profile_pairing).timeit(1000) 0.575916051864624 >>> Timer(profile_keying).timeit(1000) 0.39961695671081543 >>> xs = map(lambda n: tuple(range(n, n + 10)), range(1000, 2000)) >>> ys = map(lambda n: tuple(range(n, n + 10)), range(1500, 2500)) >>> Timer(profile_pairing).timeit(100) 0.5289111137390137 >>> Timer(profile_keying).timeit(100) 0.4951910972595215 >>> xs = map(lambda n: tuple(range(n, n + 10)), range(10000, 20000)) >>> ys = map(lambda n: tuple(range(n, n + 10)), range(15000, 25000)) >>> Timer(profile_pairing).timeit(10) 0.6034290790557861 >>> Timer(profile_keying).timeit(10) 0.9461970329284668 >>> xs = map(lambda n: tuple(range(n, n + 10)), range(100000, 200000)) >>> ys = map(lambda n: tuple(range(n, n + 10)), range(150000, 250000)) >>> Timer(profile_pairing).timeit(1) 0.6421248912811279 >>> Timer(profile_keying).timeit(1) 1.253270149230957 

(Note: using deque * instead of list(reverse(...)) was 1% faster in all cases. * See a few changes back)

+1
source

how about this:

 As = set([1, 2, 5, 6]) Bs = set([2, 3, 4, 5]) values = [(a, a) if a in Bs else (a, None) for a in As] + [(None, b) for b in Bs if b not in As] >>> values [(1, None), (2, 2), (5, 5), (6, None), (None, 3), (None, 4)] 

or using sets:

 >>> values = [(a, a) for a in As.intersection(Bs)] + [(a, None) for a in As - Bs] + [(None, b) for b in Bs - As] >>> values [(2, 2), (5, 5), (1, None), (6, None), (None, 3), (None, 4)] >>> 

I am not worried about duplicates in As or B, as there will be none.

therefore, creation then gives an almost constant search time.

The order of Rs is unimportant.

so we can check A and then just check B :)

I donโ€™t know if this is right, itโ€™s only from the head if I have problems that I do.

UPDATE
this is a little more complicated since we cannot hash it we are basically stuck with O (n ** 2) sorry: (

I tried to make it as optimal as possible.

 As = [1, 2, 5, 6] Bs = [2, 3, 4, 5] indicies_a, indicies_b, values = [], [], [] for index_a, a in enumerate(As): for index_b, b in enumerate(Bs): if cmp(a, b) == 0: values.append((a, b)) indicies_a.append(index_a) indicies_b.append(index_b) values += [(As[index], None) for index in set(range(len(As))) - set(indicies_a)] + [(None, Bs[index]) for index in set(range(len(Bs))) - set(indicies_b)] >>> values [(2, 2), (5, 5), (1, None), (6, None), (None, 3), (None, 4)] >>> 

Sorry, I couldnโ€™t come up with a more concise version, I tried my vest to do this as quickly as possible.

+1
source

I understand that this does not account for the cmp() operation, but hopefully this helps a bit.

 >>> As = [1, 2, 5, 6] >>> Bs = [2, 3, 4, 5] >>> result = [] >>> for n in set(As + Bs): ... result.append((n if n in As else None, n if n in Bs else None)) ... >>> result [(1, None), (2, 2), (None, 3), (None, 4), (5, 5), (6, None)] >>> 

Same as comp list:

 >>> result = [ (n if n in As else None, n if n in Bs else None) for n in set(As + Bs)] >>> result [(1, None), (2, 2), (None, 3), (None, 4), (5, 5), (6, None)] >>> 
+1
source

I think this is similar to what you already have:

 from collections import deque def pairs(xs, ys): xs = deque(sorted(xs)) ys = deque(sorted(ys)) while xs and ys: c = cmp(xs[0], ys[0]) if c == 0: yield xs.popleft(), ys.popleft() elif c < 0: yield xs.popleft(), None else: # c > 0: yield None, ys.popleft() for x in xs: yield x, None for y in ys: yield None, y xs = [1, 2, 5, 6] ys = [2, 3, 4, 5] print list(pairs(xs, ys)) # [(1, None), (2, 2), (None, 3), (None, 4), (5, 5), (6, None)] 
+1
source
 ([(i,j) for i in As for j in Bs if cmp(i,j) == 0] + [(i,None) for i in As if all(cmp(i,j) !=0 for j in Bs)] + [(None, j) for j in Bs if all(cmp(i,j) !=0 for i in As)]) 

however, the time complexity will be n ^ 2, because we could not determine if the element is larger than the other according to your description.

+1
source

All Articles