In the process of clustering documents, as a stage of data preprocessing, I first applied a singular vector decomposition to obtain U , S and Vt , and then, choosing an appropriate number of eigenvalues โโI truncated by Vt , which now gives me a good correlation of the document and the document from that what i read here . Now I am clustering on the columns of the Vt matrix to group similar documents together, and for this I selected k-means, and the original results looked acceptable to me (with k = 10 clusters), but I wanted to dig a little deeper when choosing the value of k itself . To determine the number of k clusters in k-means, I suggested looking at cross-validation.
Before implementing it, I wanted to find out if there is a built-in way to achieve this using numpy or scipy. Currently, the way I execute kmeans is to just use a function from scipy.
import numpy, scipy
Assuming my methodology is still correct (please correct me if I skip any step), at this stage, what is the standard way to use output to perform cross validation? Any references / implementations / suggestions on how this will be applied to k-tools would be greatly appreciated.
source share