Data situation

Question

Data situation

Suppose I have data as follows.

11AM user1 Brush

11:05 AM user1 Prep Brakfast

11:10 am user1 eat breakfast

11:15 AM user1 Take a bath

11:30 am user1 Vacation for the office

12PM user2 Brush

12:05 PM user2 Prep Brakfast

12:10 PM user2 eat breakfast

12:15 PM user2 Take a bath

12:30 pm user2 Vacation for the office

11AM user3 Take a bath

11:05 AM user3 Prep Brakfast

11:10 user3 Brush

11:15 am user3 eat breakfast

11:30 am user3 Vacation for the office

12PM user4 Take a bath

12:05 PM user4 Prep Brakfast

12:10 PM user4 Brush

12:15 pm user4 eat breakfast

12:30 pm user4 Vacation for the office

These data tell me about the everyday life of different people. From this data, it seems that user1 and user2 behave the same way (although there is a time difference when they perform their activities, but they follow the same sequence). For the same reason, User3 and User4 behave in a similar way. Now I have to group these users into different groups. In this example, group1-user1 and USer2 ... then group2, including user3 and user4

How can I approach a similar situation. I am trying to learn data mining, and this is an example that I saw as a data mining problem. I am trying to find an approach to a solution, but I can’t think about it. I believe that this data has a sample. but I can’t think of an approach that can reveal it. In addition, I have to match this approach with the data set that I have, which is quite huge, but similar to this one. Data refers to logs reporting events at a time. And I want to find groups representing a similar sequence of events.

Any pointers would be appreciated.

+7

data-mining text-mining

user722856 30 sept '11 at 17:23

source share

2 answers

Using an object mining algorithm such as Apriori, as suggested in another answer, is not the best solution, since Apriori does not take into account time or sequential order. Thus, to consider the order, an additional pre-processing step is required.

The best solution is to use a sequential pattern search algorithm such as PrefixSpan, SPADE or CM-SPADE. A sequential pattern analysis algorithm will directly find the subsequences that often appear in the sequence set.

Then you can apply clustering on found sequential patterns!

0

Phil Apr 11 '15 at 10:22

source share

ffriend · Accepted Answer · 2011-09-30T20:43:04+0000

It looks like clustering on top of an intelligent , more precisely Apriori matching algorithm. Something like that:

Combine all the possible associations between actions, i.e., Bush → Prep Breakfast, Prep Breakfast → Eat Breakfast, ..., Bush → Prep Breakfast → Eat breakfast, etc. Each pair, triplet, four, etc. you can find in your data.
Create a separate attribute from each such sequence. For better performance, add boost 2 for paired attributes, 3 for triplets, and so on.
At this point, you should have an attribute vector with the corresponding boost vector. You can calculate the vector of functions for each user: set 1 * boost at each position in the vector if this sequence exists in the user’s actions and 0 otherwise). You will get a vector representation of each user.
This vector uses the clustering algorithm that best suits your needs. Each class found is a group that you use.

Example:

Mark all actions as letters:

a - Brush
b - Breakfast with breakfast
c - Oriental breakfast
d - Take a bath ...

Your attributes will look like

a1: a-> b
a2: a-> c
a3: a-> d
...
a10: b-> a
a11: b-> c
a12: b-> d
...
a30: a-> b-> c-> d
a31: a-> b-> d-> c
...

In this case, the vectors of user functions will be:

attributes = a1, a2, a3, a4, ..., a10, a11, a12, ..., a30, a31, ... user1 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ... user2 = 1, 0, 0, 0, ..., 0, 1, 0, ..., 4, 0, ... user3 = 0, 0, 0, 0, ..., 0, 0, 0, ..., 0, 0, ...

To compare 2 users, a certain distance measure is required. The simplest is the distance from the cosine , that is, simply the cosine value between two feature vectors. If 2 users have exactly the same sequence of actions, their similarity will be 1. If they have nothing in common, their similarity will be 0.

Using a distance measure, use a clustering algorithm (for example, k-means ) to create user groups.

Data situation

More articles: