Heuristics for finding items that often appear together in a large dataset

Question

Heuristics for finding items that often appear together in a large dataset

Problem:

I have a list of millions of transactions. Each transaction contains elements (for example, "carrots", "apples"), the goal is to create a list of pairs of elements that are often found together in separate transactions. As far as I can tell, an exhaustive search is not possible.

Decision:

So far I have two ideas. 1) Randomly give an example of the corresponding share of transactions and check only those, or 2) count how often each element appears, use this data to calculate how often the elements should appear together randomly and use them to change the estimate from 1.

Any advice, alternative approaches, turnkey solutions or just general reading recommendations are welcome.

Edit:

Additional information from comments

The number of different items: from 1000 to 100 000

Memory limit: several gigabytes of ram for several hours.

Frequency of use: more or less than once.

Available resources: 20-100 hours of a novice programmer.

The format of the list of desired results is a pair of elements and some measures, how often they appear, for the n most frequent pairs.

Distribution of goods per transaction: Unknown at this time.

+7

algorithm statistics heuristics

Raceg Jan 4 '13 at 14:14

source share

2 answers

If the number of items ordered in one purchase is small (<10), follow these steps:
have a map of cards (dictionary of dictionaries): the key on the first map is an element,
the value on the first card is the card whose key is the second element, the value counts how many times it appeared when buying with the first key.

So, go through each order and each card update pair. At the end, go through the map of the cards and find "large values" in the "second value"

Note: depending on the size and "distribution" of the input data, you may lose insufficient memory

0

NoSenseEtAl Jan 4 '13 at 16:36

source share

amit · Accepted Answer · 2013-01-04T16:13:25+0000

Let the number of transactions n , the number of elements k , and the average transaction size - d .

A naive approach (check the pair in all entries) will give you a solution O(k^2 * n * d) , which is not very optimal. But we can improve it to O(k*n*d) , and if we accept a uniform distribution of elements (i.e., each element is repeated on average O(n*d/k) times) - we could improve it to O(d^2 * n + k^2) (which is much better, since most probably d << k ).

This can be done by constructing the inverted index of your transactions, that is - create a map from the elements for their transactions (creating an index O(nd + k) ).

Example if you have transactions

 transaction1 = ('apple','grape') transaction2 = ('apple','banana','mango') transaction3 = ('grape','mango')

The inverted index will be:

 'apple' -> [1,2] 'grape' -> [1,3] 'banana' -> [2] 'mango' -> [2,3]

So, understanding what an inverted index is, here are some recommendations to solve:

Create an inverted index for your data
For each element x, iterate over all the documents that it appears, and create a histogram for all pairs (x,y) such that y matches x .
When you are done, you have a histogram containing k ^ 2 elements that you need to process. This question discusses how to get top-k elements from an unsorted list.

Difficulty Analysis:

Building the inverted index O(nd+k)
Assuming that each element is repeated in O(nd/k) transactions, each iteration takes O(nd/k * d) time, and there are k iterations in this step, so you get O(nd^2 + k) for this step .
List processing can be done in O (k ^ 2logk), if you want to completely order or just want to print the top elements of X, this can be done in O(k^2) .

The totality in the solution is O(nd^2 + k^2) for obtaining elements of the upper X, which is much better than the naive approach, assuming d << k .

Also, note that the bottleneck (step 2) can be efficiently parallelized and distributed between threads if necessary.

Heuristics for finding items that often appear together in a large dataset

More articles: