Data collection and frequent data sets

Question

Data collection and frequent data sets

A few days later I did some work for my exams, and I look through some past documents, but, unfortunately, there are no answers. I answered a question, and I was wondering if anyone could tell me if I am right.

My question

(c) The transaction information package, T, is shown below:
t1: Milk, chicken, beer
t2: Chicken, cheese
t3: cheese, boots
t4: cheese, chicken, beer,
t5: chicken, beer, clothes, cheese, milk
t6: Clothing, beer, milk
t7: beer, milk, clothes
Suppose the minimum media is 0.5 (min = 0.5).
(i) Find all common items.

Here is how I developed it:

Item: Amount
Milk: 4
Chicken: 4
Beer: 5
Cheese: 4
Boots: 1
Clothing: 3

Now that minsup is 0.5, you eliminate shoes and clothes and make combos of the remaining givers:

{items}: Amount
{Milk, Chicken}: 2
{Milk, beer}: 4
{Milk, cheese}: 1
{Chicken, beer}: 3
{Chicken, cheese}: 3
{Beer, cheese}: 2

What leaves milk and beer as the only frequent element established then, since it is the only one over minson?

+4

data mining

Nanor Jan 4 '13 at 20:56

source share

3 answers

I agree that you should go for the Apriori algorithm.

The Apriori algorithm is based on the idea that in order for a pair of elements to be frequent, each individual element must also be frequent. If a bowl of hamburger ketchup is frequent, the hamburger itself should also often appear in baskets. The same can be said about ketchup.

So, an “threshold X” is set for the algorithm to determine what is or is not frequent. If an item is displayed more than X times, it is considered frequent.

The first step of the algorithm is to transfer each element in each basket and calculate their frequency (count how many times it appears). This can be done using a hash of size N, where the position of the hash y refers to the frequency Y.

If the element y has a frequency greater than X, they say that it is frequent.

At the second stage of the algorithm, we repeat the elements again, calculating the frequency of the pairs in the baskets. The trick is that we only calculate elements that are individually frequent. Therefore, if the elements y and item z are frequent on ourselves, we then calculate the frequency of the pair. This condition significantly reduces the pairs for calculation and the amount of memory.

Once this is calculated, frequencies above the threshold are called frequent sets of elements.

( http://girlincomputerscience.blogspot.com.br/2013/01/frequent-itemset-problem-for-mapreduce.html )

+2

Renata Feb 06 '13 at 10:34

source share

OK, to get started, you must first understand that data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and generalizing it to useful information - information that can be used to increase revenue, or both . Data mining software is one of many analytic tools for data analysis. It allows users to analyze data from different sizes or angles, classify them and summarize identified relationships. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Now the amount of raw data stored in corporate databases is exploding. Starting with trillions of transactions in terms of selling and buying credit cards on pixel images of galaxies, databases are currently measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For example, Wal-Mart downloads 20 million transactions every day from the point of view of sales into a massive parallel A&T system with 483 processors running a centralized database. However, the raw data does not provide much information. In today's fiercely competitive business environment, companies need to quickly turn these terabytes of raw data into significant information about their customers and markets to guide their marketing, investment and management strategies.

You should now understand that association management is an important data mining model. Its mining algorithms detect all associations of elements (or rules) in the data that meet the minimum requirements of minimum support (minsup) and minimum confidence (minconf). Minsup controls the minimum number of cases that a rule should cover. Minconf controls the predictive power of the rule. Since only one minsup is used for the entire database, the model implicitly assumes that all elements in the data are of the same nature and / or have similar frequencies in the data. However, this rarely happens in real world applications. In many applications, some elements appear very often in the data, while others rarely appear. If minsup is set too high, those rules that contain rare elements will not be found. To find rules that involve both frequent and rare items, minsup must be set very low. This can cause a combinatorial explosion, as these frequent elements will be connected to each other in all possible ways. This dilemma is called the rare item problem. This article proposes a new method for solving this problem. This method allows the user to specify several minimum supports to reflect the nature of the elements and their various frequencies in the database. In the process of developing rules, different rules may need to meet different minimum supports, depending on which elements are in the rules.

Given the set of transactions T (database), the problem of mine association rules is to discover all association rules that have support and trust that exceed the minimum support specified by the user (called minsup) and the minimum confidence (called minconf).

I hope that once you understand the basics of data mining, the answer to this question will become obvious.

0

user1949882 Jan 4 '13 at 9:38

source share

ankita · Accepted Answer · 2013-01-05T09:52:56+0000

There are two ways to solve the problem:

using Apriori algorithm
Using FP Counting

Assuming you are using Apriori, the answer you received is correct. The algorithm is simple:
First, you consider frequent item sets and exclude items below the minimum support.
Then count the frequent collections of two items by combining the frequent items from the previous iteration and excluding the set of items below the support threshold.
The algorithm may continue as long as no item sets are larger than the threshold.
In the task given to you, you get only 1 set of 2 items that exceed the threshold so that you cannot move on.
There is a resolved example of further steps in Wikipedia here .

For more information, see Han and Cumber, "Data Mining Concepts and Methods."

Data collection and frequent data sets

More articles: