Heuristics for finding items that often appear together in a large dataset

Problem:

I have a list of millions of transactions. Each transaction contains elements (for example, "carrots", "apples"), the goal is to create a list of pairs of elements that are often found together in separate transactions. As far as I can tell, an exhaustive search is not possible.

Decision:

So far I have two ideas. 1) Randomly give an example of the corresponding share of transactions and check only those, or 2) count how often each element appears, use this data to calculate how often the elements should appear together randomly and use them to change the estimate from 1.

Any advice, alternative approaches, turnkey solutions or just general reading recommendations are welcome.

Edit:

Additional information from comments

The number of different items: from 1000 to 100 000

Memory limit: several gigabytes of ram for several hours.

Frequency of use: more or less than once.

Available resources: 20-100 hours of a novice programmer.

The format of the list of desired results is a pair of elements and some measures, how often they appear, for the n most frequent pairs.

Distribution of goods per transaction: Unknown at this time.

+7
source share
2 answers

Let the number of transactions n , the number of elements k , and the average transaction size - d .

A naive approach (check the pair in all entries) will give you a solution O(k^2 * n * d) , which is not very optimal. But we can improve it to O(k*n*d) , and if we accept a uniform distribution of elements (i.e., each element is repeated on average O(n*d/k) times) - we could improve it to O(d^2 * n + k^2) (which is much better, since most probably d << k ).

This can be done by constructing the inverted index of your transactions, that is - create a map from the elements for their transactions (creating an index O(nd + k) ).

Example if you have transactions

 transaction1 = ('apple','grape') transaction2 = ('apple','banana','mango') transaction3 = ('grape','mango') 

The inverted index will be:

 'apple' -> [1,2] 'grape' -> [1,3] 'banana' -> [2] 'mango' -> [2,3] 

So, understanding what an inverted index is, here are some recommendations to solve:

Difficulty Analysis:

  • Building the inverted index O(nd+k)
  • Assuming that each element is repeated in O(nd/k) transactions, each iteration takes O(nd/k * d) time, and there are k iterations in this step, so you get O(nd^2 + k) for this step .
  • List processing can be done in O (k ^ 2logk), if you want to completely order or just want to print the top elements of X, this can be done in O(k^2) .

The totality in the solution is O(nd^2 + k^2) for obtaining elements of the upper X, which is much better than the naive approach, assuming d << k .

Also, note that the bottleneck (step 2) can be efficiently parallelized and distributed between threads if necessary.

+7
source

If the number of items ordered in one purchase is small (<10), follow these steps:
have a map of cards (dictionary of dictionaries): the key on the first map is an element,
the value on the first card is the card whose key is the second element, the value counts how many times it appeared when buying with the first key.

So, go through each order and each card update pair. At the end, go through the map of the cards and find "large values" in the "second value"

Note: depending on the size and "distribution" of the input data, you may lose insufficient memory

0
source

All Articles