Using k-means for clustering documents, if clustering should be similar to cosine or vector vectors?

Question

Using k-means for clustering documents, if clustering should be similar to cosine or vector vectors?

Sorry if the answer to this is obvious, please, this is my first time here :-)

I would appreciate it if someone could give me guidance on the appropriate input structure for k-tools. I am working on a master's thesis in which I propose a new approach to weighing the term TF-IDF, characteristic of my domain. I want to use k-tools to group the results, and then apply a series of internal and external evaluation criteria to find out if my new word weighting method makes any sense.

My actions are still (implemented in PHP), all working

Step 1: read the document collection Step 2. Clear the document collection, extract the function, select the function Step 3: time frequency (TF) Step 4: reverse document frequency (IDF) Step 5: TF * IDF Step 6: Normalize the TF-IDF to fixed lengths of vectors

Where am i scared

Step 7: The Vector Space Model - Similar to Cosine

The only examples I can find are to compare the input request with each document and find the similarities. Where there is no input request (this is not an information retrieval system), do I compare each document in the corpus with every other document in the corpus (each pair of documents)? I cannot find any example of the Cosine similarity applied to a complete collection of documents, and not to one example / query compared to the collection.

Step 8: K-Tools

, , k- ( ). k- . , , k-, . , ..

- K- , - .

- , .

+4

php cluster-analysis k-means cosine-similarity tf-idf

Claire McMahon 11 '15 12:51

5

Anony-Mousse · Answer 1 · 2015-05-11T13:06:47+0000

K- .

k- , .

k -, : L2.

, k- . k- , .

, PHP. . -, .

carence · Answer 2 · 2015-05-11T13:24:12+0000

I Anony-Mousse , PHP Python, :

Numpy: .

SciPy: k-: .

Theano: , .

k . Python. , , , , , , , .

Claire McMahon · Answer 3 · 2015-05-12T12:17:46+0000

- , , k- , ( ), . , , , , k-, , , , , k- , 0 . .

Computergodzilla · Answer 4 · 2015-05-17T18:21:44+0000

Use TF-IDF to calculate cosine similarity. Use cosine similarity estimates as input to your clustering algorithm.

Eyes charming · Answer 5 · 2015-05-30T12:19:38+0000

Look .. Simple search: vector space model

Using k-means for clustering documents, if clustering should be similar to cosine or vector vectors?

More articles: