Summary: How to choose the correct n-gram size

Question

Summary: How to choose the correct n-gram size

I am working on a generalization of texts using the nltk library. I can extract the bigrams and trigrams and arrange them by frequency.

Since I am very new to this area (NLP), I was wondering if I can use a statistical model that will allow me to automatically select the desired Ngrams size (which I mean by the size of the N-gram length, the word is unigram, two words bigram or three trigram words)

For example, let's say I have this text that I want to generalize, and as a summary I will save only the 5 most relevant N-grams:

"A more principled way to estimate sentence importance is using random walks 
and eigenvector centrality. LexRank[5] is an algorithm essentially identical 
to TextRank, and both use this approach for document summarization. The two 
methods were developed by different groups at the same time, and LexRank 
simply focused on summarization, but could just as easily be used for
keyphrase extraction or any other NLP ranking task." wikipedia

Then, as an output, I want to have "random walks", "texRank", "lexRanks", "document compilation", "keyphrase extraction", "NLP ranking task"

In other words, my question is: how can I conclude that a unigram will be more relevant than a bigram or trigram? (using only frequency as a measure of relevance, the N-gram will not give me the results I want to have)

Can someone point me to a research article, algorithm or course where such a method is already used or explained?

Thanks in advance.

+4

nlp information-retrieval data-mining text-mining summary

sel Jan 21 '15 at 16:33

source share

2 answers

Felipe Martins Melo · Answer 1 · 2015-04-23T17:15:25+0000

, , (Biterm), , , n-. , , , .

, -.

user3810264 · Answer 2 · 2015-01-31T07:33:55+0000

, . , , tf-idf n-, . n-, .
google N-gram http://www.ngrams.info/.

Summary: How to choose the correct n-gram size

More articles: