Delete duplicate values and summarize the corresponding column values

Question

Delete duplicate values and summarize the corresponding column values

I have a list from which I need to remove duplicate values and summarize the corresponding column values. List:

lst = [['20150815171000', '1', '2'], ['20150815171000', '2', '3'], ['20150815172000', '3', '4'], ['20150815172000', '4', '5'], ['20150815172000', '5', '6'], ['20150815173000', '6', '7']]

Now I need to go through the list and get the output something like this:

 lst2 = [['20150815171000', '3', '5'], ['20150815172000', '12', '15'], ['20150815173000', '6', '7']]

How can I do that? I tried to write the code as shown below, but it just compares with consecutive values not, not all the relevant ones.

  lst2 = [] ws = wr = power = 0 for i in range(len(lst)): if lst[i][0] == lst[i+1][0]: time = lst[i][0] ws = (float(lst[i][1])+float(lst[i+1][1])) wr = (float(lst[i][2])+float(lst[i+1][2])) else: time = lst[i][0] ws = lst[i][1] wr = lst[i][2] lst2.append([time, ws, wr, power])

Can someone tell me how can I do this?

+5

python list duplicates

Vinod MS Sep 09 '15 at 9:20

source share

5 answers

Anand s kumar · Answer 1 · 2015-09-09T09:34:40+0000

I would use itertools.groupby , grouping based on the first item in the internal list.

So, first I sorted the list based on the first element, and then based it on the group (if the list is already sorted by this element, then you will not need to sort again, you can directly group).

Example -

 new_lst = [] for k,g in itertools.groupby(sorted(lst,key=lambda x:x[0]) , lambda x:x[0]): l = list(g) new_lst.append([k,str(sum([int(x[1]) for x in l])), str(sum([int(x[2]) for x in l]))])

Demo -

 >>> import itertools >>> >>> lst = [['20150815171000', '1', '2'], ... ['20150815171000', '2', '3'], ... ['20150815172000', '3', '4'], ... ['20150815172000', '4', '5'], ... ['20150815172000', '5', '6'], ... ['20150815173000', '6', '7']] >>> >>> new_lst = [] >>> for k,g in itertools.groupby(sorted(lst,key=lambda x:x[0]) , lambda x:x[0]): ... l = list(g) ... new_lst.append([k,str(sum([int(x[1]) for x in l])), str(sum([int(x[2]) for x in l]))]) ... >>> new_lst [['20150815171000', '3', '5'], ['20150815172000', '12', '15'], ['20150815173000', '6', '7']]

m00am · Answer 2 · 2015-09-09T09:40:20+0000

You can use the dictionary to manage unique entries in your list. Then you check to see if there is a key already contained in the dict keys. If the key is already in the dict, then add it to the current one, otherwise add a new entry in the dict.

Try the following:

 #!/usr/bin/env python3 sums = dict() for key, *values in lst: try: # add to an already present entry in the dict sums[key] = [int(x)+y for x, y in zip(values, sums[key])] except KeyError: # if the entry is not already present add it to the dict # and cast the values to int to make the adding easier sums[key] = map(int, values) # build the output list from dictionary # also cast back the values to strings lst2 = sorted([[key]+list(map(str, values)) for key, values in sums.items()])

sorted on the last line may be optional. Depending on whether you need a list of results to sort using the dict keys or not.

Note that this should work for any length of values after the key.

Anzel · Answer 3 · 2015-09-09T10:27:07+0000

As an alternative, I would suggest using pandas , right directly with groupby and sum , here is one way to do this:

 In [1]: import pandas as pd In [2]: df = pd.DataFrame( [['20150815171000', '1', '2'], ['20150815171000', '2', '3'], ['20150815172000', '3', '4'], ['20150815172000', '4', '5'], ['20150815172000', '5', '6'], ['20150815173000', '6', '7']], columns=['group', 'field1', 'field2']) In [3]: df Out[3]: group field1 field2 0 20150815171000 1 2 1 20150815171000 2 3 2 20150815172000 3 4 3 20150815172000 4 5 4 20150815172000 5 6 5 20150815173000 6 7 # need to convert from '1', '2'... to integer type In [4]: df['field1'] = df['field1'].astype('int') In [5]: df['field2'] = df['field2'].astype('int') # this groupby(to_group_field) and sum() can achieve what you want In [6]: df.groupby('group').sum() Out[6]: field1 field2 group 20150815171000 3 5 20150815172000 12 15 20150815173000 6 7 # convert to the list of lists format as you expected In [7]: df.groupby('group').sum().reset_index().values.tolist() Out[7]: [['20150815171000', 3, 5], ['20150815172000', 12, 15], ['20150815173000', 6, 7]]

Hope this helps.

Mark shuster · Answer 4 · 2015-09-09T11:37:11+0000

Clean with lambda and sorted () with a dictionary. No additional libraries.

 lst = [['20150815171000', '1', '2'], ['20150815171000', '2', '3'], ['20150815172000', '3', '4'], ['20150815172000', '4', '5'], ['20150815172000', '5', '6'], ['20150815173000', '6', '7']] dct = dict() for a, b, c in lst: if a not in dct: dct[a] = [b, c] else: dct[a] = map(lambda x, y: str(int(x)+int(y)), dct[a], [b,c]) lst2 = sorted([[k,v[0],v[1]] for k,v in dct.items()]) print(lst2)

Of:

 [['20150815171000', '3', '5'], ['20150815172000', '12', '15'], ['20150815173000', '6', '7']]

Sparkas · Answer 5 · 2015-09-09T10:15:55+0000

As your question commented, I would also suggest using a dictionary for reference. I am not a good programmer, and there are certainly better ways, but this works:

 dct = dict() for x, y, z in lst: if x not in dct: dct[x] = [y, z] else: dct[x] = [str(int(dct[x][0]) + int(y)), str(int(dct[x][1]) + int(z))] lst2 = [] for k, v in dct.items(): lst2.append([k, v[0], v[1]])

Basically, you just iterate over the list and add a new element to the dictionary if the required number (for example, "2015081517100") does not exist yet, otherwise updating the corresponding values. In the end, you just create another list from the results in the dictionary

Delete duplicate values ​​and summarize the corresponding column values

More articles:

Delete duplicate values and summarize the corresponding column values