IDF Smoothing Ngram

I am trying to use IDF scores to find interesting phrases in my rather huge body of documents.
I basically need something like Amazon's statistically incredible phrases, i.e. Phrases that distinguish a document from everyone else The problem that I am facing is that some (3,4) -grams in my data that have super-high idf actually consist of component unigrams and bigrams that have really low idf.
For example, β€œyou never tried” has a very high idf, while each of the component unigrams has a very low idf.
I need to come up with a function that can take the frequencies of an n-gram document and all its components (nk) -grams and return a more meaningful measure of how this phrase will distinguish the parent document from the rest.
If I were dealing with probabilities, I would try interpolation or delay models. I am not sure what assumptions / intuitions these models use to work well, and therefore how well they will do to evaluate IDFs.
Anyone have any better ideas?

+4
source share
1 answer

I believe β€œyou never tried” is a phrase that you don’t want to extract, but that has a high IDF. The problem will be that there will be a huge number of n-grams that are found in only one document and therefore have the highest possible IDF score.

NLP has many smoothing methods. This article [ Chen & Goodman ) is a pretty good summary of many of them. In particular, it seems that you may be interested in the Kneser-Ney anti - aliasing algorithm, which works as you suggest (backing down to n-grams of shorter length).

These methods are commonly used for language modeling tasks, i.e. to assess the likelihood of an n-graph taking into account the really large body of the language. I really don't know how you can integrate them with IDF ratings, or even if this is really what you want to do.

+4
source

Source: https://habr.com/ru/post/1312476/


All Articles