Finding the nth largest product in a large matrix of numbers, quick

I am working on a sorting / ranking algorithm that works with a fairly large number of elements, and I need to effectively implement the following algorithm to make it work:


There are two lists of numbers. They are equally long, about 100-500 thousand units. From this I need to find the nth largest product between these lists, i.e. if you create a matrix where there is one list on the top, on the side you have another, and each cell is the product of the number above and the number on the side.

Example: lists A=[1, 3, 4] and B=[2, 2, 5] . Then the works [2, 2, 5, 6, 6, 15, 8, 8, 20] . If I wanted the 3rd largest of this, there would be 8.

A naive solution would be to simply generate these numbers, sort them, and then choose the nth largest. But this is O(m^2 * log m^2) , where m is the number of items in small lists, and this is simply not fast enough.

I think I need to sort two small lists first. This is O(m * log m) . Then I know for sure that the largest is A [0] * B [0]. The second largest is A [0] * B [1] or [1] * B [0], ...

It seems to me that this can be done in steps O(f(n)) , regardless of the size of the matrix. But I can not find an effective way to do this part.


Edit: an answer was found that suggested storing the position in two sorted sets, and then looking at A [a] * B [b + 1] and A [a + 1] * B [b], returning a large one and increasing a / b . I was going to post this comment before it is deleted:

This will not work. Imagine two lists A = B = [3,2,1]. This will give you a matrix like [9,6,3; 6.4.2; 3,2,1]. So, you start with (0,0) = 9, go to (0,1) = 6, and then the choice is (0,2) = 3 or (1,1) = 4. However, this will skip (1, 0) = 6, which is more than both. So you can’t just look at the two neighbors, but you need to step back.

+7
source share
3 answers

I think this can be done in O(n log n + n log m) . Here is a sketch of my algorithm, which I think will work. This is a little rude.

  • Sorting in decreasing order. (takes O(m log m) )
  • Sort B in descending order. (takes O(m log m) )
  • Let s be min(m, n) . (takes O(1) )
  • Create s lazy iterators of the sequence L[0] through L[s-1] . L[i] will go through the values s A[i]*B[0] , A[i]*B[1] , ..., A[i]*B[s-1] . (takes O(s) )
  • Put iterators in the q priority queue. Iterators will be prioritized according to their current value. (accepts O(s) , because initially they are already fine)
  • Extract n values ​​from q . The last displayed value will be the desired result. When the iterator extends, it is reinserted into q , using its next value as the new priority. If the iterator is exhausted, do not re-insert it. (takes O(n log s) )

In general, this algorithm will take O(m log m + (s + n)log s) , but s is either m or n .

+4
source

You do not need to sort 500,000 items to get the top 3.

Just take the first 3, put them in a SortedList and iterate over the list, replacing the smallest of the three elements with the new value, if it is higher, and resort to the resulting list.

Do this for both lists, and you will end up with a 3 * 3 matrix, where it should be easy to take the third value.

Here is an implementation in scala .

Assuming that n is less than m and A = [1, 3, 4] and B = [2, 2, 5], n = 2:

You would take (3, 4) => sort them (4.3)
Then we take (2,5) => sort them (5, 2)

Now you can search in zip format. Of course, the biggest product is now (5, 4). But the next one is either (4 * 2) or (5 * 3). For longer lists, you can keep in mind what was the result of 4 * 2, compare it only with the next product, taken in a different way. That way you would have figured out one product too much.

0
source

I do not think that there is an O (f (n)) algorithm that is independent of m.

But there is a relatively fast O (n * logm) algo:

First, we sort the two arrays, we get A [0]> A [1]> ...> A [m-1] and B [0]> B [1]> ...> B [m-1]. (This, of course, is O (mlogm).)

Then we construct a max-heap whose elements are A [0] * B [0], A [0] * B [1], ... A [0] * B [m-1]. And we support the "pointer array" P [0], P [1], ... P [m-1]. P [i] = x means that B [i] * A [x] is currently in the heap. At first, all P [i] are equal to zero.

At each iteration, we pull the max element out of the heap, which is the next biggest product. Assuming this comes from B [i] * A [P [i]] (we can write the items on the heap coming from B [i]), then move the corresponding pointer forward: P [i] + = 1, and paste new B [i] * A [P [i]] in a bunch. (If P [i] moves to range (> = m), we simply push a -inf into the heap.)

After the nth iteration, we get the nth largest product.

There are n iterations, each of which is equal to O (logm).

Edit: add some details

0
source

All Articles