How can I get the most popular phrases from a lot of text?

Question

How can I get the most popular phrases from a lot of text?

I am setting up a Twitter-style trend window for my forum. I have the most popular / words /, but I can’t even think how I will get popular phrases, for example, Twitter.

Be that as it may, I just get the contents of the last 200 messages into a string and break them into words, and then sort what words are used most. How can I turn this from the most popular words into the most popular phrases?

+6

php

katoth Oct 13 '10 at 20:19

source share

3 answers

Instead of separating separate words separating separate phrases, it is just as simple.

 $popular = array(); foreach ($tweets as $tweet) { // split by common punctuation chars $sentences = preg_split('~[.!?]+~', $string); foreach ($sentences as $sentence) { $sentence = strtolower(trim($sentence)); // normalize sentences if (isset($popular[$sentence]) === false) //if (array_key_exists($sentence, $popular) === false) { $popular[$sentence] = 0; } $popular[$sentence]++; } } arsort($popular); echo '<pre>'; print_r($popular); echo '</pre>';

This will be much slower if you consider the phrase as a collection of n consecutive words.

+1

Alix axel Oct 13 '10 at 20:36

source share

I'm not sure what type of answer you were looking for, but Laconica:

http://status.net/?source=laconica

This is an open source tweeter (a much simpler version).

Perhaps you could use a piece of code to make your own popular phrases?

Good luck

+1

Trufa Oct 14 '10 at 4:20

source share

mattbasta · Accepted Answer · 2010-10-14T04:13:23+0000

One way you might consider using ZSET in Redis for something like that. If you have very large datasets, you will find that you can do something like this:

$words = explode(" ", $input); // Pseudo-code for breaking a block of data into individual words. $word_count = count($words); $r = new Redis(); // Owlient PHPRedis PECL extension $r->connect("127.0.0.1", 6379); function process_phrase($phrase) { global $r; $phrase = implode(" ", $phrase); $r->zIncrBy("trending_phrases", 1, $phrase); } for($i=0;$i<$word_count;$i++) for($j=1;$j<$word_count - $i;$j++) process_phrase(array_slice($words, $i, $j));

To get the top phrases you should use this:

 // Assume $r is instantiated like it is above $trending_phrases = $r->zReverseRange("trending_phrases", 0, 10);

$trending_phrases will be an array of the ten most popular phrases. To do things like recent phrases (as opposed to persistent global phrases), duplicate all of the Redis interactions above. For each interaction, use a key that indicates, for example, today's timestamp and the date of tomorrow (i.e.: days from January 1, 1970). When retrieving results using $trending_phrases simply download the key today and tomorrow (or yesterday) and use array_merge and array_unique to find the union.

Hope this helps!

How can I get the most popular phrases from a lot of text?

More articles: