Comparison of 2D datasets / scatterplots

I have 2000 datasets that contain no more than 1000 2D variables. I am going to group these datasets anywhere from 20-100 clusters based on affinity. However, I am having problems with a reliable method for comparing datasets. I tried several (rather primitive) approaches and did a lot of research, but I can not find anything that matches what I need to do.

I placed the image below from three sets of my data. Data is limited to 0-1 on the y axis and is within the range of 0-0.10 on the x axis (in practice, but may be greater than 0.10 in theory).

The form and relative proportions of the data are probably the most important for comparison. However, the absolute locations of each dataset are also important. In other words, the closer the relative position of each individual point to the individual points of another data set, the more they will be similar, and then their absolute positions should be taken into account.

Green and red should be considered as very different, but the push is coming, they should be more like blue and red.

http://img153.imageshack.us/img153/6730/screenshot20110204at004.png

I tried:

  • compare based on total kinks and deviations
  • variables are divided into coordinate areas (i.e. (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20) ... (0.9-1 , 0, 0.9-1.0)) and compare the similarity based on common points in the regions
  • I tried to measure the average Euclidean distance to the nearest neighbors among the datasets.

All of them gave erroneous results. The closest answer I could find in my research is Appropriate similarity metrics for several sets of 2D coordinates . "However, the answer to this question involves comparing the average distance between nearest neighbors from the center of gravity, which, it seems to me, will not work for me as a direction, as important as distance for my purposes.

I could add that this will be used to generate data for input of another program and will be used only sporadically (mainly for generating different data sets with different numbers of clusters), therefore algorithms that do not require much time are missing questions.

+6
language-agnostic algorithm cluster-analysis graphics similarity
source share
2 answers

In two stages

1) First: tell the blues separately.

Calculate the average distance of the nearest neighbor to the edge. Select crop, for example, black distance in the following image:

enter image description here

Blue configurations, since they are more scattered, will give you much more results than red and green.

2) Second: talk about everything red and green

Do not pay attention to all points whose distance to the nearest neighbor is greater than something less (for example, a quarter of the previous distance). Clusterize for proximity to get form clusters:

enter image description here and enter image description here

Drop clusters with less than 10 points (or so). For each cluster, a linear fit is performed and covariances are calculated. The average covariance for red will be much higher than for green, since the greens are very aligned on this scale.

Here you.

NTN!

+1
source share

Although belisarius answered it well, here are a few comments:

if you can reduce each set of 1000 points to say 32 clusters of 32 points each (or 20 x 50 or ...), then you can work in 32-space instead of 1000-space. Try K-means clustering for this; see also SO questions / tagged / k-means .

One way to measure the distance between the sets A, B (points, clusters) should take the following pairs:

def nearestpairsdistance( A, B ): """ large point sets A, B -> nearest b each a, nearest a each b """ # using KDTree, http://docs.scipy.org/doc/scipy/reference/spatial.html Atree = KDTree( A ) Btree = KDTree( B ) a_nearestb, ixab = Btree.query( A, k=1, p=p, eps=eps ) # p=inf is fast b_nearesta, ixba = Atree.query( B, k=1, p=p, eps=eps ) if verbose: print "a_nearestb:", nu.quantiles5(a_nearestb) print "b_nearesta:", nu.quantiles5(b_nearesta) return (np.median(a_nearestb) + np.median(b_nearesta)) / 2 # means are sensitive to outliers; fast approx median ? 

Then you can combine your 2000 points in 32-space with up to 20 cluster centers in one shot:

 centres, labels = kmeans( points, k=20, iter=3, distance=nearestpairsdistance ) 

(the usual Euclidean distance will not work here at all).

Please follow him; tell us what worked in the end and what didn’t.

+1
source share

All Articles