Calculation of IDF (as in TF-IDF) during testing?

Question

Calculation of IDF (as in TF-IDF) during testing?

As I understand it, IDF is used to calculate the number of documents that have a term (just an idea). You can calculate the IDF (along with TF) in the training set, since you have all the documents in advance. But what if I don’t have a test suite in advance, and I receive test documents in a consistent way (for example, from a web crawler), then how will I calculate the IDF for the words in the document when it comes to testing

+5

text classification information-retrieval tf-idf

Killbill Apr 11 '12 at 14:39

source share

2 answers

If you only run tests after indexing / crawling a whole batch of documents, you can calculate the IDF after the scan is complete. You do not need to calculate an IDF when you come across a new document or a new term. You can calculate it on the fly when you need it to perform TD-IDF or other calculations.

, - IDF , .

0

Felipe Hummel 11 . '12 20:52

MRFS · Accepted Answer · 2012-05-03T20:54:48+0000

For this condition, if your data set is large enough, you can only use the training set for IDF. at the testing stage, if the new term in the train set uses the IDF for training, and if this term is new, use the number of documents for the train set to calculate the IDF. For some purposes, you can use anti-aliasing techniques to get better results.

Calculation of IDF (as in TF-IDF) during testing?

More articles: