Remove duplicate dict in list in Python

Question

Remove duplicate dict in list in Python

I have a list of dicts and I would like to remove dicts with identical key and value pairs.

For this list: [{'a': 123}, {'b': 123}, {'a': 123}]

I would like to return this: [{'a': 123}, {'b': 123}]

Another example:

For this list: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

I would like to return this: [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

+100

python dictionary list

Brenden Feb 24 2018-12-12T00:

source share

9 answers

Another single line font based on list comprehension:

 >>> d = [{'a': 123}, {'b': 123}, {'a': 123}] >>> [i for n, i in enumerate(d) if i not in d[n + 1:]] [{'b': 123}, {'a': 123}]

Here, since we can use the dict comparison, we only save elements that are not in the rest of the original list (this concept is only available through index n , therefore, using enumerate ).

+38

Emmanuel Feb 24 '12 at 9:05

source share

Sometimes old loops are still useful. This code is slightly longer than jcollado, but very easy to read:

 a = [{'a': 123}, {'b': 123}, {'a': 123}] b = [] for i in range(0, len(a)): if a[i] not in a[i+1:]: b.append(a[i])

+13

Scorpil Feb 24 '12 at 8:10

source share

Other answers will not work if you work with nested dictionaries such as deserialized JSON objects. For this case you can use:

 import json set_of_jsons = {json.dumps(d, sort_keys=True) for d in X} X = [json.loads(t) for t in set_of_jsons]

+13

stpk Aug 2 '16 at 13:52

source share

If you want to save the order, you can do

 from collections import OrderedDict print OrderedDict((frozenset(item.items()),item) for item in data).values() # [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

If the order doesn't matter, you can do

 print {frozenset(item.items()):item for item in data}.values() # [{'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}]

+9

thefourtheye Apr 29 '14 at 7:52

source share

If you can use a third-party package, then you can use iteration_utilities.unique_everseen :

 >>> from iteration_utilities import unique_everseen >>> l = [{'a': 123}, {'b': 123}, {'a': 123}] >>> list(unique_everseen(l)) [{'a': 123}, {'b': 123}]

It preserves the order of the original list and ut can also process unsuitable elements, such as dictionaries, using a slower algorithm ( O(n*m) where n are elements in the original list and m are unique elements in the original list. O(n) ) If the keys and values are hashable, you can use the key argument of this function to create hashable elements for the "uniqueness test" (so that it works in O(n) ).

In the case of a dictionary (which is compared regardless of order), you need to map it to another data structure that is compared in a similar way, for example, frozenset :

 >>> list(unique_everseen(l, key=lambda item: frozenset(item.items()))) [{'a': 123}, {'b': 123}]

Note that you should not use the simple tuple approach (without sorting), because equal dictionaries do not necessarily have the same order (even in Python 3.7, where the insertion order - not the absolute order - is guaranteed):

 >>> d1 = {1: 1, 9: 9} >>> d2 = {9: 9, 1: 1} >>> d1 == d2 True >>> tuple(d1.items()) == tuple(d2.items()) False

And even sorting a tuple may not work if the keys are not sorted:

 >>> d3 = {1: 1, 'a': 'a'} >>> tuple(sorted(d3.items())) TypeError: '<' not supported between instances of 'str' and 'int'

benchmark

I thought it would be useful to see a comparison of these approaches, so I did a little test. The graphs for comparing the time and size of the list are based on a list that does not contain duplicates (which was chosen arbitrarily, the runtime will not change significantly if I add several or many duplicates). This is a logarithmic plot, so the entire range is covered.

Absolute times:

Timing regarding the fastest approach:

The second approach from the fourth is the fastest here. The unique_everseen approach with the key function is in second place, however this is the fastest approach, preserving order. Other approaches from jcollado and thefourtheye are almost as fast. The approach using unique_everseen without a key and solutions from Emmanuel and Scorpil is very slow for long lists and behaves much worse than O(n*n) instead of O(n) . The stpk approach with json not O(n*n) but it is much slower than the similar O(n) approaches.

Code for playing tests:

 from simple_benchmark import benchmark import json from collections import OrderedDict from iteration_utilities import unique_everseen def jcollado_1(l): return [dict(t) for t in {tuple(d.items()) for d in l}] def jcollado_2(l): seen = set() new_l = [] for d in l: t = tuple(d.items()) if t not in seen: seen.add(t) new_l.append(d) return new_l def Emmanuel(d): return [i for n, i in enumerate(d) if i not in d[n + 1:]] def Scorpil(a): b = [] for i in range(0, len(a)): if a[i] not in a[i+1:]: b.append(a[i]) def stpk(X): set_of_jsons = {json.dumps(d, sort_keys=True) for d in X} return [json.loads(t) for t in set_of_jsons] def thefourtheye_1(data): return OrderedDict((frozenset(item.items()),item) for item in data).values() def thefourtheye_2(data): return {frozenset(item.items()):item for item in data}.values() def iu_1(l): return list(unique_everseen(l)) def iu_2(l): return list(unique_everseen(l, key=lambda inner_dict: frozenset(inner_dict.items()))) funcs = (jcollado_1, Emmanuel, stpk, Scorpil, thefourtheye_1, thefourtheye_2, iu_1, jcollado_2, iu_2) arguments = {2**i: [{'a': j} for j in range(2**i)] for i in range(2, 12)} b = benchmark(funcs, arguments, 'list size') %matplotlib widget import matplotlib as mpl import matplotlib.pyplot as plt plt.style.use('ggplot') mpl.rcParams['figure.figsize'] = '8, 6' b.plot(relative_to=thefourtheye_2)

For completeness, here is the time for a list containing only duplicates:

 # this is the only change for the benchmark arguments = {2**i: [{'a': 1} for j in range(2**i)] for i in range(2, 12)}

unique_everseen does not change significantly, except for unique_everseen without a key function, which in this case is the fastest solution. Nevertheless, this is just the best option (therefore not representative) for this function with inexhaustible values, because its execution time depends on the number of unique values in the list: O(n*m) which in this case is only 1 and therefore is executed in O(n) .

Disclaimer: I am the author of iteration_utilities .

+4

MSeifert Jul 17 '18 at 19:43

source share

If you use Pandas in your workflow, one option is to pd.DataFrame list of dictionaries directly to the pd.DataFrame constructor. Then use drop_duplicates and to_dict for the desired result.

 import pandas as pd d = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}] d_unique = pd.DataFrame(d).drop_duplicates().to_dict('records') print(d_unique) [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

+4

jpp Aug 01 '18 at 13:34

source share

Not a universal answer, but if your list is sorted by some key, for example like this:

 l=[{'a': {'b': 31}, 't': 1}, {'a': {'b': 31}, 't': 1}, {'a': {'b': 145}, 't': 2}, {'a': {'b': 25231}, 't': 2}, {'a': {'b': 25231}, 't': 2}, {'a': {'b': 25231}, 't': 2}, {'a': {'b': 112}, 't': 3}]

then the solution is as simple as:

 import itertools result = [a[0] for a in itertools.groupby(l)]

Result:

 [{'a': {'b': 31}, 't': 1}, {'a': {'b': 145}, 't': 2}, {'a': {'b': 25231}, 't': 2}, {'a': {'b': 112}, 't': 3}]

It works with nested dictionaries and (obviously) keeps order.

+1

Highstaker Jun 14 '18 at 7:49

source share

You can use a set, but you need to turn dicts into a hashed type.

 seq = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}] unique = set() for d in seq: t = tuple(d.iteritems()) unique.add(t)

Unique is now equal

 set([(('a', 3222), ('b', 1234)), (('a', 123), ('b', 1234))])

To get an answer:

 [dict(x) for x in unique]

0

Matimus Feb 24 2018-12-12T00:

source share

jcollado · Accepted Answer · 2012-02-24 07:51

Try this:

 [dict(t) for t in {tuple(d.items()) for d in l}]

The strategy is to convert the list of dictionaries into a list of tuples in which tuples contain dictionary elements. Since tuples can be hashed, you can remove duplicates using set (using the set understanding here, there will be set(tuple(d.items()) for d in l) an older alternative to python set(tuple(d.items()) for d in l) ), and then recreate the dictionaries from the tuples with dict .

Where:

l original list
d is one of the dictionaries in the list
t is one of the tuples created from the dictionary

Edit: If you want to keep order, the single line line above will not work, since set will not do this. However, with a few lines of code, you can also do this:

 l = [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}, {'a': 123, 'b': 1234}] seen = set() new_l = [] for d in l: t = tuple(d.items()) if t not in seen: seen.add(t) new_l.append(d) print new_l

Output Example:

 [{'a': 123, 'b': 1234}, {'a': 3222, 'b': 1234}]

Note. As @alexis points out, two dictionaries with the same keys and values may not give the same tuple. This can happen if they go through another history of adding / removing keys. If this is the case for your problem, then consider sorting d.items() as it suggests.

Remove duplicate dict in list in Python

benchmark

More articles: