Can an array be grouped more efficiently than sorted?

While working on a sample code for an algorithm question, I came across a situation where I sorted an input array, although I needed to have the same elements grouped together, but not in any specific order, for example:

{1,2,4,1,4,3,2} β†’ {1,1,2,2,4,4,3} or {1,1,2,2,3,4,4} or {3 , 1,1,2,2,4,4} or ...

What made me think: is it possible to group identical elements in an array more efficiently than by sorting the array?

On the one hand, the fact that the elements do not need to be moved to a specific location means more freedom to search for an order that requires fewer swaps. On the other hand, tracking where each element in a group is located and what the optimal end location is may require more computation than just sorting the array.

A logical candidate would be a sort type, but what if the length of the array and / or the range of values ​​was impractically large?

For the argument, suppose that the array is large (for example, a million elements), contains 32-bit integers, and the number of identical elements per value can be from 1 to a million.


UPDATE: for languages ​​that support dictionaries, Salvador Dali's answer is obviously the way to go. I would still be interested to learn about old-fashioned methods of comparison and replacement, or about methods that use less space, if any.

+6
source share
4 answers

Yes, all you have to do is create a dictionary and count the number of elements each time. After that, simply sort through the keys in this dictionary and display this key as much time as the value of this key.

Quick python implementation:

from collections import Counter arr = [1,2,4,1,4,3,2] cnt, grouped = Counter(arr), [] # counter create a dictionary which counts the number of each element for k, v in cnt.iteritems(): grouped += [k] * v # [k] * v create an array of length v, which has all elements equal to k print grouped 

This will lead to the grouping of all elements in O(n) time using the potential O(n) extra space. Which is more efficient (in terms of time complexity) than sorting, which will achieve this in O(n logn) time and can be done in place.

+3
source

Since you asked about methods based on matching, I’m going to make the usual assumptions that (1) elements can be compared but not hashed (2), the only interesting resource is tripartite operations.

In the absolute sense, grouping is easier than sorting. Here's a grouping algorithm for three elements using one comparison (sorting requires three). When entering x, y, z , if x = y , return x, y, z . Otherwise, return x, z, y .

Asymptotically, however, grouping and sorting requires Omega(n log n) comparisons Omega(n log n) . The lower bound method is an information-theoretic one: we prove that for each grouping algorithm expressed as a decision tree, there is a 3^Omega(n log n) leaf, which means that the height of the tree (and, therefore, the worst running time of the algorithm) Omega(n log n) .

Fix an arbitrary leaf of the decision tree where no input element was found. Input positions are partially ordered by the inequalities found.

Assume the contrary that i, j, k are pairwise incomparable input positions. If we leave x = input[i], y = input[j], z = input[k] , the possibilities x = y < z and y = z < x and z = x < y are consistent with what the algorithm observed. This cannot be, since it is impossible for one order selected by the sheet to place x next to y next to z next to x . We conclude that the partial order does not have antichannel power of three.

Dilworth's theorem , partial order has two chains that span the entire input. Considering all possible ways of combining these chains in full order, there are no more than n choose m ≀ 2^n permutations that are displayed on each sheet. Thus, the number of leaves is at least n!/2^n = 3^Omega(n log n) .

+6
source

Any sorting algorithm, even the most efficient one, will require you to traverse the array several times. Grouping, on the other hand, can be done with exactly one iteration, depending on how you insist that your result be formatted in two:

 groups = {} for i in arr: if i not in groups: groups[i] = [] groups[i].append(i) 

This is an extremely primitive loop, ignoring many of the optimizations and idioms that are probably available in your language of choice, but only after one iteration:

 {1: [1, 1], 2: [2, 2], 3: [3], 4: [4, 4]} 

If you have complex objects, you can select any arbitrary attribute to group as a dictionary key, so this is a very general algorithm.

If you insist that your result be a flat list, you can easily achieve this:

 result = [] for l in groups: result += l 

(Again, ignoring specific language optimizations and idioms.)

So, you have a permanent workaround that requires no more than one complete iteration of the input and one less iteration of the intermediate structure of the grouping data. Space requirements depend on the specifics of the language, but, as a rule, this is just a small part of the service data that the dictionary and list data structures carry.

+1
source

How about using a two-dimensional array with the 1st dimension being the frequency of each value, and the second dimension is the value itself. We can use the Boolean data type and indexing. It also allows us to sort the original array instantly, iterating over the original array exactly once, providing us with an O(n) solution. I think this approach will translate well into other languages. Observe the following basic R code (NB in ​​R there are much more efficient ways than below, I just give a more general approach).

 GroupArray <- function(arr.in) { maxVal <- max(arr.in) arr.out.val <- rep(FALSE, maxVal) ## F, F, F, F, ... arr.out.freq <- rep(0L, maxVal) ## 0, 0, 0, 0, ... for (i in arr.in) { arr.out.freq[i] <- arr.out.freq[i]+1L arr.out.val[i] <- TRUE } myvals <- which(arr.out.val) ## "which" returns the TRUE indices array(c(arr.out.freq[myvals],myvals), dim = c(length(myvals), 2), dimnames = list(NULL,c("freq","vals"))) } 

A small example of the above code:

 set.seed(11) arr1 <- sample(10, 10, replace = TRUE) arr1 [1] 3 1 6 1 1 10 1 3 9 2 ## unsorted array GroupArray(arr1) freq vals ## Nicely sorted with the frequency [1,] 4 1 [2,] 1 2 [3,] 2 3 [4,] 1 6 [5,] 1 9 [6,] 1 10 

Example:

 set.seed(101) arr2 <- sample(10^6, 10^6, replace = TRUE) arr2[1:10] ## First 10 elements of random unsorted array [1] 372199 43825 709685 657691 249856 300055 584867 333468 622012 545829 arr2[999990:10^6] ## Last 10 elements of random unsorted array [1] 999555 468102 851922 244806 192171 188883 821262 603864 63230 29893 664059 t2 <- GroupArray(arr2) head(t2) freq vals ## Nicely sorted with the frequency [1,] 2 1 [2,] 2 2 [3,] 2 3 [4,] 2 6 [5,] 2 8 [6,] 1 9 tail(t2) freq vals [632188,] 3 999989 [632189,] 1 999991 [632190,] 1 999994 [632191,] 2 999997 [632192,] 2 999999 [632193,] 2 1000000 
+1
source

All Articles