Combination of TF-IDF (cosine similarity) with pagerank?

Given the request, I have a cosine rating for the document. I also have pagerank docs. Is there a standard good way to combine these two?

I thought about breeding them

Total_Score = cosine-score * pagerank 

Because if you reach a minimum at the level of a pager or cosine, the document is not interesting.

Or is it preferable to have a weighted amount?

 Total_Score = weight1 * cosine-score + weight2 * pagerank 

This is better? Then you may have a zero cosine coefficient, but a high percentage, and the page will be displayed among the results.

+6
source share
4 answers

a weighted amount is probably better than a ranking rule.

This helps break down the problem into a search / filtering step and a ranking step. The problem identified by the weighted sum method is no longer performed.

The process described in this paper by Sergey Brin and Lawrence Page uses a variant of the vector cosine model for extraction and seems to some a weighted sum for ranking, where weights are determined by user activity (see section 4.5.1). Using this approach, a document with zero cosine will not be able to go through the search / filtering stage and, therefore, will not be considered for ranking.

+1
source

I understand that you are making a compromise between relativity and importance. This is a multi-window optimization problem .

I think your second solution will work. This is the so-called linear scalarization . You need to know how to optimize weight. But methods for this can be found with different philosophies and subjective, depending on the primacy of each variable in each case. In fact, how to optimize weight in such a problem is the field of mathematics research . Therefore, it is difficult to indicate which model or method is most suitable for your case. You might want to continue working with the wiki links listed above and try to find some principles for such problems, and then follow them to solve your own question.

0
source

You can use the value of harmonics . With an average value of harmonics of 2, the estimates will be essentially averaged, but low scores will draw the average value more than the average.

You can use:

 Total_Score = 2*(cosine-score * pagerank) / (cosine-score + pagerank) 

Let's say that pagerank scored 0.1 and cosine 0.9, the normal average of these two numbers will be: (0.1 + 0.9)/2 = 0.5 , the average value of harmonics will be: 2*(0.9*0.1)/(0.9 + 0.1) = 0.18 .

0
source

I can not imagine a single case when this would be useful. Pagrank calculates how an โ€œimportantโ€ document is measured as a connection to other important documents (I assume you mean. Edges are a document for documenting links based on the terms โ€œmatchesโ€. If you mean something else, specify )

The cosine score is an indicator of the similarity between two documents. So, your idea is to combine the pairwise metric with the node metric to find only important documents similar to another document? Why not just run the pagerank on the ego network of another document?

-1
source

All Articles