Parameter Evaluation in DBSCAN

I need to find natural classes of nouns based on their distribution with different prepositions (for example, agents, instrumental, time, place, etc.). I tried using k-mean clusters, but with less help, this did not work, there were a lot of matches over the classes I was looking for (probably due to the nonsmooth form of the classes and random initialization in k-means).

Now I am working on using DBSCAN, but it is difficult for me to understand the value of epsilon and the value of mini-dots in this clustering algorithm. Can I use random values ​​or do I need to calculate them. Can someone help. In particular, with epsilon, at least how to calculate it if I need to.

+6
source share
1 answer

Use domain knowledge to select options. Epsilon is the radius. You can think of it as the minimum cluster size.

It is obvious that random values ​​will not work very well. As a heuristic, you can try to look at a k-distance plot; but it is not automatic.

The first thing to do anyway is to choose a good distance function for your data. And do the appropriate normalization.

As for "minPts", it again depends on your data and needs. One user may have a completely different meaning than another. And, of course, the miniatures and Epsilon are connected. If you double epsilon, it will be approximately necessary for you to increase your minPts by 2 ^ d (for the Euclidean distance, because the volume of the hypersphere increases this way!)

If you want a lot of small and small detailed clusters, choose low minpts. If you need more and less clusters (and more noise), use larger minpts. If you don't want clusters at all, select minpts larger than your dataset size ...

+6
source

All Articles