System threshold for the similarity of cosines to TF-IDF scales

Question

System threshold for the similarity of cosines to TF-IDF scales

I am analyzing several thousand (for example, 10,000) text documents. I calculated the weight of TF-IDF and had a matrix with similar pairwise cosines. I want to consider documents as a graph for analyzing various properties (for example, the length separating groups of documents) and visualizing connections as a network.

The problem is that there are too many similarities. Most of them are too small to be meaningful. I see that many people face this problem, dropping all similarities below a certain threshold, for example, similarities below 0.5.

However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I am looking for methods that are more objective or systematic to get rid of tiny similarities.

I am open to many different strategies. For example, is there another tf-idf alternative that would make most of the small similarities 0? Other methods to preserve only significant similarities?

+4

text document cluster-analysis similarity

pequod Mar 05 '15 at 16:03

source share

1 answer

Andrew Scott Evans · Answer 1 · 2015-07-09T05:37:29+0000

In short, take the average cosine of the initial clustering or even all of the initial offers and accept or reject the clusters based on something like the following.

- , (1,5 (86- , ) 3 (99,9- ) outlier), . , , .

, .

average(cosine_similarities)+alpha*standard_deviation(cosine_similarities)

, , NLTK. , . , 1- . LSA LDA. 1,5 , 1 + Wu Palmer ( ), K, , .

, , . , 10000 - . , , - 15 000, 20 - 20 000 . , API - 20 . senti-wordnet.

, .

, , , . t / wu-palmer SOV , . Commons Math3 java/ scala , scipy python, R -.

Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)

. . , , . , , , , , , .

System threshold for the similarity of cosines to TF-IDF scales

More articles: