Calculation of IDF (as in TF-IDF) during testing?

As I understand it, IDF is used to calculate the number of documents that have a term (just an idea). You can calculate the IDF (along with TF) in the training set, since you have all the documents in advance. But what if I don’t have a test suite in advance, and I receive test documents in a consistent way (for example, from a web crawler), then how will I calculate the IDF for the words in the document when it comes to testing

+5
source share
2 answers

For this condition, if your data set is large enough, you can only use the training set for IDF. at the testing stage, if the new term in the train set uses the IDF for training, and if this term is new, use the number of documents for the train set to calculate the IDF. For some purposes, you can use anti-aliasing techniques to get better results.

+2
source

If you only run tests after indexing / crawling a whole batch of documents, you can calculate the IDF after the scan is complete. You do not need to calculate an IDF when you come across a new document or a new term. You can calculate it on the fly when you need it to perform TD-IDF or other calculations.

, - IDF , .

0

All Articles