How to efficiently calculate the similarity between documents in a document flow

I am collecting text documents (in Node.js) where one document i is presented as a list of words. What is an effective way to calculate the similarity between these documents, given that new documents come in as a kind of document flow?

Currently, I am using cos-similarity in the normalized frequency of words in each document. I do not use TF-IDF (time frequency, reverse document frequency) because of a scalability problem, as I receive more and more documents.

Originally

My first version was to start with the currently available documents, calculate the large Term-Document A matrix, and then calculate S = A^T x A so that S(i, j) (after normalizing as with norm(doc(i)) and norm(doc(j)) ) the cos-similarity between documents i and j whose words are doc(i) and doc(j) respectively.

For new documents

What should I do when I receive a new doc(k) document? Well, I have to calculate the similarity of this document with all the previous ones, which does not require the creation of a whole matrix. I can just take the internal product doc(k) dot doc(j) for all previous j , and this will result in S(k, j) , which is great.

Problems

  • The calculation of S in Node.js is really great. Too long! Therefore, I decided to create a C ++ module that will do everything much faster. And so it is! But I can’t wait for this, I would have to use intermediate results. And what I mean by "do not wait" is

    a. wait for the calculations to complete, but also b. wait until matrix A is constructed (it is large).

  • Computing the new S(k, j) can take advantage of the fact that documents have fewer words than the set of all the indicated words (which I use to build the entire matrix A ). Thus, it looks faster in Node.js, while avoiding a large number of additional resources for accessing data.

But is there a better way to do this?

Note The reason I started calculating S is because I can easily build A in Node.js, where I have access to all the data, and then multiply the matrix by C ++ and return it to Node.js, which greatly speeds up all this. But now that the calculation of S becoming impossible, it looks useless.

Note 2 : yep, I do not need to calculate the entire S , I can just calculate the upper right elements (or lower left), but that is not a question. The problem of calculating the time does not match the specified order.

+6
source share
1 answer

If you need to solve it today, just use pre-prepared dictionary vectors from fasttext or word2vec

0
source

All Articles