Python group

Suppose I have a pair of data sets where index 0 is the value and index 1 is the type:

input = [ ('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH') ] 

I want to group them by type (by the first indexed row) as such:

 result = [ { type:'KAT', items: ['11013331', '9843236'] }, { type:'NOT', items: ['9085267', '11788544'] }, { type:'ETH', items: ['5238761', '962142', '7795297', '7341464', '5594916', '1550003'] } ] 

How can I achieve this in an efficient way?

thank

+69
python group-by
Sep 20 '10 at 7:50
source share
4 answers

Do it in 2 steps. First create a dictionary.

 >>> input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')] >>> from collections import defaultdict >>> res = defaultdict(list) >>> for v, k in input: res[k].append(v) ... 

Then convert this dictionary to the expected format.

 >>> [{'type':k, 'items':v} for k,v in res.items()] [{'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}] 



This is also possible with itertools.groupby, but you need to sort the input first.

 >>> sorted_input = sorted(input, key=itemgetter(1)) >>> groups = groupby(sorted_input, key=itemgetter(1)) >>> [{'type':k, 'items':[x[0] for x in v]} for k, v in groups] [{'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}, {'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}] 



Note that both of them do not match the original key order. You need an order if you need to save an order.

 >>> from collections import OrderedDict >>> res = OrderedDict() >>> for v, k in input: ... if k in res: res[k].append(v) ... else: res[k] = [v] ... >>> [{'type':k, 'items':v} for k,v in res.items()] [{'items': ['11013331', '9843236'], 'type': 'KAT'}, {'items': ['9085267', '11788544'], 'type': 'NOT'}, {'items': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'type': 'ETH'}] 
+99
Sep 20 '10 at 7:54
source share

The itertools Python built-in module actually has a groupby function that you could use, but the elements that should be grouped must first be sorted so that the elements that should be grouped are adjacent in the list:

 sortkeyfn = key=lambda s:s[1] input = [('11013331', 'KAT'), ('9085267', 'NOT'), ('5238761', 'ETH'), ('5349618', 'ETH'), ('11788544', 'NOT'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('9843236', 'KAT'), ('5594916', 'ETH'), ('1550003', 'ETH')] input.sort(key=sortkeyfn) 

Now the input is as follows:

 [('5238761', 'ETH'), ('5349618', 'ETH'), ('962142', 'ETH'), ('7795297', 'ETH'), ('7341464', 'ETH'), ('5594916', 'ETH'), ('1550003', 'ETH'), ('11013331', 'KAT'), ('9843236', 'KAT'), ('9085267', 'NOT'), ('11788544', 'NOT')] 

groupby returns a sequence of 2 tuples of the form (key, values_iterator) . We want to turn this into a list of dicts, where "type" is the key, and "items" is a list of the 0th element of the tuples returned by the_iterator value. Like this:

 from itertools import groupby result = [] for key,valuesiter in groupby(input, key=sortkeyfn): result.append(dict(type=key, items=list(v[0] for v in valuesiter))) 

Now result contains your desired dict, as indicated in your question.

Perhaps you can think of it simply by setting out one recorder from this, with a key by type and each value containing a list of values. In your current form, to find the values ​​for a particular type, you have to iterate over the list to find a dict containing the corresponding type β€œtype”, and then get the β€œitems” element from it. If you use a single recorder instead of a list of 1-element dicts, you can find elements for a specific type with one key in the main dict. Using groupby , it will look like this:

 result = {} for key,valuesiter in groupby(input, key=sortkeyfn): result[key] = list(v[0] for v in valuesiter) 

result now contains this dict (this is similar to intermediate res defaultdict in @KennyTM answer):

 {'NOT': ['9085267', '11788544'], 'ETH': ['5238761', '5349618', '962142', '7795297', '7341464', '5594916', '1550003'], 'KAT': ['11013331', '9843236']} 

(If you want to reduce this to a single line, you can:

 result = dict((key,list(v[0] for v in valuesiter) for key,valuesiter in groupby(input, key=sortkeyfn)) 

or using the newfangled expression form:

 result = {key:list(v[0] for v in valuesiter) for key,valuesiter in groupby(input, key=sortkeyfn)} 
+35
Sep 20 '10 at 8:28
source share

The following function will quickly ( not sort ) groups of tuples of any length with a key that has any index:

 # given a sequence of tuples like [(3,'c',6),(7,'a',2),(88,'c',4),(45,'a',0)], # returns a dict grouping tuples by idx-th element - with idx=1 we have: # if merge is True {'c':(3,6,88,4), 'a':(7,2,45,0)} # if merge is False {'c':((3,6),(88,4)), 'a':((7,2),(45,0))} def group_by(seqs,idx=0,merge=True): d = dict() for seq in seqs: k = seq[idx] v = d.get(k,tuple()) + (seq[:idx]+seq[idx+1:] if merge else (seq[:idx]+seq[idx+1:],)) d.update({k:v}) return d 

In the case of your question, the index of the key you want to group is 1, so:

 group_by(input,1) 

gives

 {'ETH': ('5238761','5349618','962142','7795297','7341464','5594916','1550003'), 'KAT': ('11013331', '9843236'), 'NOT': ('9085267', '11788544')} 

which is not exactly the result you requested, but can also satisfy your needs.

+1
Jun 13 '16 at 11:22
source share

I also liked pandas simple grouping . it is powerful, simple and most suitable for a large data set

result = pandas.DataFrame(input).groupby(1).groups

0
Nov 02 '16 at 5:06
source share



All Articles