I believe βyou never triedβ is a phrase that you donβt want to extract, but that has a high IDF. The problem will be that there will be a huge number of n-grams that are found in only one document and therefore have the highest possible IDF score.
NLP has many smoothing methods. This article [ Chen & Goodman ) is a pretty good summary of many of them. In particular, it seems that you may be interested in the Kneser-Ney anti - aliasing algorithm, which works as you suggest (backing down to n-grams of shorter length).
These methods are commonly used for language modeling tasks, i.e. to assess the likelihood of an n-graph taking into account the really large body of the language. I really don't know how you can integrate them with IDF ratings, or even if this is really what you want to do.
source share