K-fold Cross Check to determine k in k-value?

In the process of clustering documents, as a stage of data preprocessing, I first applied a singular vector decomposition to obtain U , S and Vt , and then, choosing an appropriate number of eigenvalues โ€‹โ€‹I truncated by Vt , which now gives me a good correlation of the document and the document from that what i read here . Now I am clustering on the columns of the Vt matrix to group similar documents together, and for this I selected k-means, and the original results looked acceptable to me (with k = 10 clusters), but I wanted to dig a little deeper when choosing the value of k itself . To determine the number of k clusters in k-means, I suggested looking at cross-validation.

Before implementing it, I wanted to find out if there is a built-in way to achieve this using numpy or scipy. Currently, the way I execute kmeans is to just use a function from scipy.

 import numpy, scipy # Preprocess the data and compute svd U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix # Obtain the document-document correlations from Vt # This 50 is the threshold obtained after examining a scree plot of S docvectors = numpy.transpose(self.Vt[0:50, 0:]) # Prepare the data to run k-means whitened = whiten(docvectors) res, idx = kmeans2(whitened, 10, iter=20) 

Assuming my methodology is still correct (please correct me if I skip any step), at this stage, what is the standard way to use output to perform cross validation? Any references / implementations / suggestions on how this will be applied to k-tools would be greatly appreciated.

+4
source share
2 answers

In order to perform k-fold cross-cross check, you need some quality for optimization. It can be either a classification indicator, such as accuracy, or F 1 , or specialized, for example, a V-measure .

Even clustering quality indicators that, as I know, need labeled datasets (the โ€œbasic truthโ€) to work; the difference in classification is that only part of your data is required for the assessment, and the k-mean algorithm can use all the data to determine centroids and, therefore, clusters.

V-measure and several other evaluations are implemented in scikit-learn, as well as a generic cross-validation code and a "grid search" module, which is optimized according to a specific evaluation indicator using k-fold CV. Disclaimer I am involved in the development of scikit-learn, although I have not written any of the codes mentioned.

+7
source

Indeed, to do a traditional cross-validation using F1-score or V-Measure as a scoring function, you will need some flagged data as true. But in this case, you can simply calculate the number of classes in the basic truth set of truth and use it as your optimal value for K, therefore, there is no need for cross-checking.

Alternatively, you can use the cluster stability measure as an uncontrolled performance measurement and do some kind of cross-validation procedure for this. However, this is not yet implemented in scikit-learn, although it is still on my personal task list.

Further information on this approach can be found in the following answer at metaoptimize.com/qa . In particular, you should read Clustering Stability: a review of Ulrike von Luxburg .

+1
source

All Articles