My data set consists of millions of rows and several (10) functions.
One feature is the label, which consists of 1000 different values ββ(imagine that each line is a user, and this function is the username:
Firstname,Feature1,Feature2,.... Quentin,1,2 Marc,0,2 Gaby,1,0 Quentin,1,0
What would be the best view for this function (to perform clustering):
I could convert the data as a whole using LabelEncoder , but that does not make sense here, since there is no logical "order" between two differents labels
Firstname,F1,F2,.... 0,1,2 1,0,2 2,1,0 0,1,0
I could divide the function into 1000 functions (one for each label) with 1 when the label matches and 0 otherwise. However, this will result in a very large matrix (too large if I cannot use the sparse matrix in my classifier)
Quentin,Marc,Gaby,F1,F2,.... 1,0,0,1,2 0,1,0,0,2 0,0,1,1,0 1,0,0,1,0
I could represent the LabelEncoder value as binary in N columns, this will reduce the dimension of the final matrix compared to the previous idea, but I'm not sure about the result:
LabelEncoder(Quentin) = 0 = 0,0 LabelEncoder(Marc) = 1 = 0,1 LabelEncoder(Gaby) = 2 = 1,0 A,B,F1,F2,.... 0,0,1,2 0,1,0,2 1,0,1,0 0,0,1,0
... any other idea?
What do you think of solution 3?
Modify Additional Explanations
I should have mentioned in my first post, but in a real data set this function is more like the last leaf of the classification tree ( Aa1 , Aa2 , etc. in the example, this is not a binary tree).
ABC Aa Ab Ba Bb Ca Cb Aa1 Aa2 Ab1 Ab2 Ab3 Ba1 Ba2 Bb1 Bb2 Ca1 Ca2 Cb1 Cb2
Thus, there is a similarity between the two terms at the same level ( Aa1 Aa2 and Aa3 very similar, and Aa1 is also very different from Ba1 than Cb2 ).
The ultimate goal is to find similar objects from a smaller dataset: we train a OneClassSVM on a smaller dataset and then get the distance from each member of the entiere dataset