Sorry if the answer to this is obvious, please, this is my first time here :-)
I would appreciate it if someone could give me guidance on the appropriate input structure for k-tools. I am working on a master's thesis in which I propose a new approach to weighing the term TF-IDF, characteristic of my domain. I want to use k-tools to group the results, and then apply a series of internal and external evaluation criteria to find out if my new word weighting method makes any sense.
My actions are still (implemented in PHP), all working
Step 1: read the document collection Step 2. Clear the document collection, extract the function, select the function Step 3: time frequency (TF) Step 4: reverse document frequency (IDF) Step 5: TF * IDF Step 6: Normalize the TF-IDF to fixed lengths of vectors
Where am i scared
Step 7: The Vector Space Model - Similar to Cosine
The only examples I can find are to compare the input request with each document and find the similarities. Where there is no input request (this is not an information retrieval system), do I compare each document in the corpus with every other document in the corpus (each pair of documents)? I cannot find any example of the Cosine similarity applied to a complete collection of documents, and not to one example / query compared to the collection.
Step 8: K-Tools
, , k- ( ). k- . , , k-, . , ..
- K- , - .
- , .