Approach No. 1
Here's something in a numpythonic vectorized way based on np.bincount-
N = A.shape[1]-1
unqA1, id = np.unique(A[:, 0], return_inverse=True)
subs = np.arange(N)*(id.max()+1) + id[:,None]
sums = np.bincount( subs.ravel(), weights=A[:,1:].ravel() )
out = np.append(unqA1[:,None],sums.reshape(N,-1).T,1)
Example input, output -
In [66]: A
Out[66]:
array([[1, 2, 3],
[1, 4, 6],
[2, 3, 5],
[2, 6, 2],
[7, 2, 1],
[2, 0, 3]])
In [67]: out
Out[67]:
array([[ 1., 6., 9.],
[ 2., 9., 10.],
[ 7., 2., 1.]])
Approach # 2
Here the other is based on np.cumsumand np.diff-
sA = A[np.argsort(A[:,0]),:]
row_mask = np.append(np.diff(sA[:,0],axis=0)!=0,[True])
cumsum_grps = sA.cumsum(0)[row_mask,1:]
sum_grps = np.diff(cumsum_grps,axis=0)
counts = np.concatenate((cumsum_grps[0,:][None],sum_grps),axis=0)
out = np.concatenate((sA[row_mask,0][:,None],counts),axis=1)
Benchmarking
Here are some runtime tests for the numpy-based approaches presented so far for the question -
In [319]: A = np.random.randint(0,1000,(100000,10))
In [320]: %timeit cumsum_diff(A)
100 loops, best of 3: 12.1 ms per loop
In [321]: %timeit bincount(A)
10 loops, best of 3: 21.4 ms per loop
In [322]: %timeit add_at(A)
10 loops, best of 3: 60.4 ms per loop
In [323]: A = np.random.randint(0,1000,(100000,20))
In [324]: %timeit cumsum_diff(A)
10 loops, best of 3: 32.1 ms per loop
In [325]: %timeit bincount(A)
10 loops, best of 3: 32.3 ms per loop
In [326]: %timeit add_at(A)
10 loops, best of 3: 113 ms per loop
It seems to Approach #2: cumsum + diffwork pretty well.