Scikitlearn - how to model individual functions consisting of several independent values

My data set consists of millions of rows and several (10) functions.

One feature is the label, which consists of 1000 different values ​​(imagine that each line is a user, and this function is the username:

Firstname,Feature1,Feature2,.... Quentin,1,2 Marc,0,2 Gaby,1,0 Quentin,1,0 

What would be the best view for this function (to perform clustering):

  • I could convert the data as a whole using LabelEncoder , but that does not make sense here, since there is no logical "order" between two differents labels

     Firstname,F1,F2,.... 0,1,2 1,0,2 2,1,0 0,1,0 
  • I could divide the function into 1000 functions (one for each label) with 1 when the label matches and 0 otherwise. However, this will result in a very large matrix (too large if I cannot use the sparse matrix in my classifier)

     Quentin,Marc,Gaby,F1,F2,.... 1,0,0,1,2 0,1,0,0,2 0,0,1,1,0 1,0,0,1,0 
  • I could represent the LabelEncoder value as binary in N columns, this will reduce the dimension of the final matrix compared to the previous idea, but I'm not sure about the result:

     LabelEncoder(Quentin) = 0 = 0,0 LabelEncoder(Marc) = 1 = 0,1 LabelEncoder(Gaby) = 2 = 1,0 A,B,F1,F2,.... 0,0,1,2 0,1,0,2 1,0,1,0 0,0,1,0 
  • ... any other idea?

What do you think of solution 3?


Modify Additional Explanations

I should have mentioned in my first post, but in a real data set this function is more like the last leaf of the classification tree ( Aa1 , Aa2 , etc. in the example, this is not a binary tree).

  ABC Aa Ab Ba Bb Ca Cb Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2 

Thus, there is a similarity between the two terms at the same level ( Aa1 Aa2 and Aa3 very similar, and Aa1 is also very different from Ba1 than Cb2 ).

The ultimate goal is to find similar objects from a smaller dataset: we train a OneClassSVM on a smaller dataset and then get the distance from each member of the entiere dataset

+5
source share
2 answers

This problem is basically one of the hot encodings. How do we present many categorical values ​​in such a way that we can use clustering algorithms, and not cross out the calculation of the distance that your algorithm should perform (you could use some kind of probabilistic model of the final mixture, but I was distracted)? As user3914041's answer, there really is no definite answer, but I will review each solution you offer and make an impression:

Solution 1

If you convert a categorical column into a numeric column, as you mentioned, then you are faced with this rather big problem that you talked about: you basically lose the meaning of this column. What does this mean, even if Quentin is at 0, 1, and 2? At this point, why even include this column in clustering? As user3914041's answer, this is the easiest way to change your categorical values ​​to numeric ones, but they are simply not useful and can be harmful to clustering results.

Decision 2

In my opinion, depending on how you implement all this and your goals with clustering, this will be your best bet. Since I assume you plan on using sklearn and something like k-Means, you should make good use of sparse matrices. However, as imaluengo suggests, you should consider using a different distance metric. What you can consider is scaling all your numerical functions in the same range as categorical functions, and then using something like cosine distance. Or a combination of distance metrics, as I mentioned below. But all this is likely to be the most useful representation of your categorical data for your clustering algorithm.

Decision 3

I agree with user3914041 that this is not useful and introduces some of the same problems that were mentioned in # 1 - you lose meaning when two (possibly) completely different names have a column value.

Decision 4

An additional solution is to follow the recommendations here . You can consider reprogramming your own version of an algorithm similar to the k-method, which combines a distance metric (distance for hamming for one-time coded categorical data and Euclidean for the rest). There seems to be some work on developing k-tools, such as algorithms for mixed categorical and numerical data, for example here .

I think it is also important to consider the need for a cluster for this categorical data. What do you hope to see?

+2
source

Solution 3:
I would say that it has the same drawback as using 1..N encoding (solution 1), in a less obvious way. You will have names that give 1 in some column, for no other reason than the encoding order ...
Therefore, I would recommend against this.

Solution 1 :
Solution 1..N is the "easy way" to solve the format problem, as you notice that this is probably not the best.

Solution 2 :
It looks like the best way to do it, but it's a little cumbersome, and in my experience, the classifier does not always work very well with a lot of categories.

Solution 4+ :
I think the encoding depends on what you want: if you think that names similar (like John and Johnny) should be close, you can use the grammar characters to represent them. I doubt that this is the case in your application.

Another approach is to encode a name with its frequency in a data set (training). So you say: "People basically should be close, whether Sophia or Jackson, doesn't matter."

Hope the suggestions help, there is no definite answer to this question, so I look forward to seeing what other people are doing.

0
source

All Articles