I am working on a generalization of texts using the nltk library. I can extract the bigrams and trigrams and arrange them by frequency.
Since I am very new to this area (NLP), I was wondering if I can use a statistical model that will allow me to automatically select the desired Ngrams size (which I mean by the size of the N-gram length, the word is unigram, two words bigram or three trigram words)
For example, let's say I have this text that I want to generalize, and as a summary I will save only the 5 most relevant N-grams:
"A more principled way to estimate sentence importance is using random walks
and eigenvector centrality. LexRank[5] is an algorithm essentially identical
to TextRank, and both use this approach for document summarization. The two
methods were developed by different groups at the same time, and LexRank
simply focused on summarization, but could just as easily be used for
keyphrase extraction or any other NLP ranking task." wikipedia
Then, as an output, I want to have "random walks", "texRank", "lexRanks", "document compilation", "keyphrase extraction", "NLP ranking task"
In other words, my question is: how can I conclude that a unigram will be more relevant than a bigram or trigram? (using only frequency as a measure of relevance, the N-gram will not give me the results I want to have)
Can someone point me to a research article, algorithm or course where such a method is already used or explained?
Thanks in advance.
source
share