Using k-means for clustering documents, if clustering should be similar to cosine or vector vectors?

Sorry if the answer to this is obvious, please, this is my first time here :-)

I would appreciate it if someone could give me guidance on the appropriate input structure for k-tools. I am working on a master's thesis in which I propose a new approach to weighing the term TF-IDF, characteristic of my domain. I want to use k-tools to group the results, and then apply a series of internal and external evaluation criteria to find out if my new word weighting method makes any sense.

My actions are still (implemented in PHP), all working

Step 1: read the document collection Step 2. Clear the document collection, extract the function, select the function Step 3: time frequency (TF) Step 4: reverse document frequency (IDF) Step 5: TF * IDF Step 6: Normalize the TF-IDF to fixed lengths of vectors

Where am i scared

Step 7: The Vector Space Model - Similar to Cosine

The only examples I can find are to compare the input request with each document and find the similarities. Where there is no input request (this is not an information retrieval system), do I compare each document in the corpus with every other document in the corpus (each pair of documents)? I cannot find any example of the Cosine similarity applied to a complete collection of documents, and not to one example / query compared to the collection.

Step 8: K-Tools

, , k- ( ). k- . , , k-, . , ..

- K- , - .

- , .

+4
5

K- .

k- , .

k -, : L2.

, k- . k- , .

, PHP. . -, .

0

I Anony-Mousse , PHP Python, :

Numpy: .

SciPy: k-: .

Theano: , .

k . Python. , , , , , , , .

0

- , , k- , ( ), . , , , , k-, , , , , k- , 0 . .

0

Use TF-IDF to calculate cosine similarity. Use cosine similarity estimates as input to your clustering algorithm.

0
source

All Articles