Defining most used php mysql word sets

I am trying to figure out how to identify the most used words in a mysql dataset.

Not sure how to do this or if there is a simpler approach. Read a couple of posts where some suggest an algorithm.

Example:

From the 24,500 entries, find the 10 most popular words.

+6
source share
5 answers

That's right, it works like a dog and is limited to working with one limiter, but hopefully gives you an idea.

SELECT aWord, COUNT(*) AS WordOccuranceCount FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord FROM SomeTable CROSS JOIN ( SELECT a.i+bi*10+ci*100 + 1 AS aCnt FROM integers a, integers b, integers c) Sub1 WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2 WHERE Sub2.aWord != '' GROUP BY aWord ORDER BY WordOccuranceCount DESC LIMIT 10 

It depends on the availability of a table called integers, with one column i with 10 rows with values ​​from 0 to 9. It processes up to ~ 1000 words, but can be easily changed to handle a large number (but it will slow down even more).

+13
source

Why not do it all in PHP? Steps would be

  • Create Dictionary (word => qty)
  • Read the data in PHP
  • Divide it into words
  • Add each word to the dictionary (you may want to make a lowercase letter and crop first)
  • If already in the dictionary, increase the counter. If not already in the dictionary, set the value 1 as its value (count = 1)
  • Iterate your vocabulary items to find the highest 10 values

I would not do this in SQL, mainly because it would be more complex.

+4
source

The general idea would be to find out how many delimiters (e.g. spaces) are in each field, and run SUBSTRING_INDEX() in a loop for each such field. Filling this into a temporary table has the added benefit of being able to run it in pieces in parallel, etc. It should not be too cumbersome to drop some SPs together for this.

+1
source
 SELECT `COLUMNNAME`, COUNT(*) FROM `TABLENAME` GROUP BY `COLUMNNAME` 

its very simple and working ... :)

+1
source

To improve a bit, remove stop words from the list with AND Sub2.aWord not in (list of stopped words)

 SELECT aWord, COUNT(*) AS WordOccuranceCount FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(txt_msg, ' '), ' ', aCnt), ' ', -1) AS aWord FROM mensagens CROSS JOIN ( SELECT a.i+bi*10+ci*100 + 1 AS aCnt FROM integers a, integers b, integers c) Sub1 WHERE (LENGTH(txt_msg) + 1 - LENGTH(REPLACE(txt_msg, ' ', ''))) >= aCnt) Sub2 WHERE Sub2.aWord != '' AND Sub2.aWord not in ('a','about','above', .....) GROUP BY aWord ORDER BY WordOccuranceCount DESC LIMIT 10 
0
source

All Articles