How to extract common / meaningful phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the only most common phrase, and ideally, without matching word for word).

My example is any review on Yelp.com that shows 3 fragments of hundreds of reviews about a restaurant in the format:

Try Hamburger (at 44 reviews)

for example, the Highlights section of this page:

http://www.yelp.com/biz/sushi-gen-los-angeles/

I have NLTK installed, and I played around with it a bit, but honestly, it’s overloaded with options. This seems like a fairly common problem, and I could not find a direct solution by doing a search here.

+59
nlp nltk text-extraction text-analysis
Mar 16 2018-10-10T00:
source share
4 answers

I suspect that you want not only the most common phrases, but also the most interesting phrases . Otherwise, you may get an excessive representation of phrases made up of common words, and fewer interesting and informative phrases.

To do this, you essentially need to extract n-grams from your data, and then find those that have the highest accurate mutual information (PMI). That is, you want to find words that come together much more than you expect them by accident.

NLTK collocation instructions describe how to do this in approximately 7 lines of code, for example:

import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() trigram_measures = nltk.collocations.TrigramAssocMeasures() # change this to read in your data finder = BigramCollocationFinder.from_words( nltk.corpus.genesis.words('english-web.txt')) # only bigrams that appear 3+ times finder.apply_freq_filter(3) # return the 10 n-grams with the highest PMI finder.nbest(bigram_measures.pmi, 10) 
+83
Mar 16 '10 at 9:35
source share

if you just want to get more than 3 ng, you can try this. I assume that you removed all the trash like html etc.

 import nltk ngramlist=[] raw=<yourtextfile here> x=1 ngramlimit=6 tokens=nltk.word_tokenize(raw) while x <= ngramlimit: ngramlist.extend(nltk.ngrams(tokens, x)) x+=1 

Probably not very pythons, as I only did this for a month or so, but could help!

+3
Mar 28 '10 at 9:12
source share

I think what you are looking for is chunking. I recommended reading chapter 7 of the NLTK book, or maybe my own article on extracting fragments . Both of them require knowledge of partial speech tags, which are described in chapter 5 .

+3
Apr 15 2018-10-15T00:
source share

Well, for starters, you probably have to remove all the HTML tags (search for "<[^>] *>" and replace it with ""). After that, you can try a naive approach to finding the longest common substrings between every two text elements, but I don’t think you will get very good results. You can do better by normalizing the words (first reducing them to their basic form, removing all accents, setting everything to lower or upper case), and then analyze. Again, depending on what you want to accomplish, you can group text elements better if you allow some flexibility in choosing the word order, that is, treat text objects as bags with normalized words and measure the similarity of the contents of the package.

I commented on a similar (though not identical) topic here .

0
Mar 16 '10 at 9:21
source share



All Articles