I have 2000 datasets that contain no more than 1000 2D variables. I am going to group these datasets anywhere from 20-100 clusters based on affinity. However, I am having problems with a reliable method for comparing datasets. I tried several (rather primitive) approaches and did a lot of research, but I can not find anything that matches what I need to do.
I placed the image below from three sets of my data. Data is limited to 0-1 on the y axis and is within the range of 0-0.10 on the x axis (in practice, but may be greater than 0.10 in theory).
The form and relative proportions of the data are probably the most important for comparison. However, the absolute locations of each dataset are also important. In other words, the closer the relative position of each individual point to the individual points of another data set, the more they will be similar, and then their absolute positions should be taken into account.
Green and red should be considered as very different, but the push is coming, they should be more like blue and red.

I tried:
- compare based on total kinks and deviations
- variables are divided into coordinate areas (i.e. (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20) ... (0.9-1 , 0, 0.9-1.0)) and compare the similarity based on common points in the regions
- I tried to measure the average Euclidean distance to the nearest neighbors among the datasets.
All of them gave erroneous results. The closest answer I could find in my research is Appropriate similarity metrics for several sets of 2D coordinates . "However, the answer to this question involves comparing the average distance between nearest neighbors from the center of gravity, which, it seems to me, will not work for me as a direction, as important as distance for my purposes.
I could add that this will be used to generate data for input of another program and will be used only sporadically (mainly for generating different data sets with different numbers of clusters), therefore algorithms that do not require much time are missing questions.
language-agnostic algorithm cluster-analysis graphics similarity
mcnulty
source share