Defining most used php mysql word sets

Question

Defining most used php mysql word sets

I am trying to figure out how to identify the most used words in a mysql dataset.

Not sure how to do this or if there is a simpler approach. Read a couple of posts where some suggest an algorithm.

Example:

From the 24,500 entries, find the 10 most popular words.

+6

string mysql

Codex73 Nov 02 '12 at 2:56

source share

5 answers

Why not do it all in PHP? Steps would be

Create Dictionary (word => qty)
Read the data in PHP
Divide it into words
Add each word to the dictionary (you may want to make a lowercase letter and crop first)
If already in the dictionary, increase the counter. If not already in the dictionary, set the value 1 as its value (count = 1)
Iterate your vocabulary items to find the highest 10 values

I would not do this in SQL, mainly because it would be more complex.

+4

Ege akpinar Feb 25 '13 at 23:57

source share

The general idea would be to find out how many delimiters (e.g. spaces) are in each field, and run SUBSTRING_INDEX() in a loop for each such field. Filling this into a temporary table has the added benefit of being able to run it in pieces in parallel, etc. It should not be too cumbersome to drop some SPs together for this.

+1

fenway Feb 21 '13 at 1:55

source share

 SELECT `COLUMNNAME`, COUNT(*) FROM `TABLENAME` GROUP BY `COLUMNNAME`

its very simple and working ... :)

+1

Mahdi Malekian Jul 28 '17 at 0:10

source share

To improve a bit, remove stop words from the list with AND Sub2.aWord not in (list of stopped words)

 SELECT aWord, COUNT(*) AS WordOccuranceCount FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(txt_msg, ' '), ' ', aCnt), ' ', -1) AS aWord FROM mensagens CROSS JOIN ( SELECT a.i+bi*10+ci*100 + 1 AS aCnt FROM integers a, integers b, integers c) Sub1 WHERE (LENGTH(txt_msg) + 1 - LENGTH(REPLACE(txt_msg, ' ', ''))) >= aCnt) Sub2 WHERE Sub2.aWord != '' AND Sub2.aWord not in ('a','about','above', .....) GROUP BY aWord ORDER BY WordOccuranceCount DESC LIMIT 10

0

Eduardo de souza Nov 14 '16 at 16:09

source share

Kickstart · Accepted Answer · 2013-02-19T16:53:06+0000

That's right, it works like a dog and is limited to working with one limiter, but hopefully gives you an idea.

SELECT aWord, COUNT(*) AS WordOccuranceCount FROM (SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(concat(SomeColumn, ' '), ' ', aCnt), ' ', -1) AS aWord FROM SomeTable CROSS JOIN ( SELECT a.i+bi*10+ci*100 + 1 AS aCnt FROM integers a, integers b, integers c) Sub1 WHERE (LENGTH(SomeColumn) + 1 - LENGTH(REPLACE(SomeColumn, ' ', ''))) >= aCnt) Sub2 WHERE Sub2.aWord != '' GROUP BY aWord ORDER BY WordOccuranceCount DESC LIMIT 10

It depends on the availability of a table called integers, with one column i with 10 rows with values from 0 to 9. It processes up to ~ 1000 words, but can be easily changed to handle a large number (but it will slow down even more).

Defining most used php mysql word sets

More articles: