Let the number of transactions n , the number of elements k , and the average transaction size - d .
A naive approach (check the pair in all entries) will give you a solution O(k^2 * n * d) , which is not very optimal. But we can improve it to O(k*n*d) , and if we accept a uniform distribution of elements (i.e., each element is repeated on average O(n*d/k) times) - we could improve it to O(d^2 * n + k^2) (which is much better, since most probably d << k ).
This can be done by constructing the inverted index of your transactions, that is - create a map from the elements for their transactions (creating an index O(nd + k) ).
Example if you have transactions
transaction1 = ('apple','grape') transaction2 = ('apple','banana','mango') transaction3 = ('grape','mango')
The inverted index will be:
'apple' -> [1,2] 'grape' -> [1,3] 'banana' -> [2] 'mango' -> [2,3]
So, understanding what an inverted index is, here are some recommendations to solve:
Difficulty Analysis:
- Building the inverted index
O(nd+k) - Assuming that each element is repeated in
O(nd/k) transactions, each iteration takes O(nd/k * d) time, and there are k iterations in this step, so you get O(nd^2 + k) for this step . - List processing can be done in O (k ^ 2logk), if you want to completely order or just want to print the top elements of X, this can be done in
O(k^2) .
The totality in the solution is O(nd^2 + k^2) for obtaining elements of the upper X, which is much better than the naive approach, assuming d << k .
Also, note that the bottleneck (step 2) can be efficiently parallelized and distributed between threads if necessary.