Learn the most popular words in MySQL / PHP

Question

Learn the most popular words in MySQL / PHP

I have a database with almost 100,000 comments, and I would like to find the most frequently used words (using stop words to avoid common words).

I want to do this only once, and then use some of the most popular words for comment tags that contain them.

Can you help me with the Query and PHP code for this? Thanks!

0

php mysql explode

Santiago Jun 29 '12 at 18:45

source share

1 answer

Aufziehvogel · Accepted Answer · 2012-06-29T19:02:51+0000

The simplest approach that I assume would be:

Create two new tables: keywords (id, word) and keywords_comments (keyword_id, comment_id, count)
- keywords stores the unique identifier and keyword that you found in the text
- keywords_comments stores one line for each connection between each comment containing this keyword. In count you save the number of times this keyword occurred in a comment. The two columns keyword_id + comment_id together form a unique or directly primary key.
Get all comments from the database
Analyze all comments and divide them into non-characters (or other boundaries).
Record these entries in your tables.

Example

You have the following two comments:

Hi how are you?!
Wow, hello. My name is Stefan.

Now you will iterate over both of them and divide them into non-characters. This will result in the following lowercase words for each text: - First text: hello, how, you, - Second text: wow, hello, my, name, is, stefan

Once you have parsed one of these texts, you can already insert it into the database again. I think you do not want to load 100,000 comments into RAM.

So it would turn out:

Parse the first text and get the keywords above
Write each keyword in the tabke keywords if it does not already exist
Set the link from the keyword to the comment ( keywords_comments ) and configure the calculation correctly (in our example, each word occurs only once in each text, you must take this into account).
Parsing the second text
...

Slight improvement

A very simple improvement that you should probably use for 100,000 comments is to use a count variable or add a new has_been_analyzed field for each comment. Then you can read their commentary on the comments from the database.

I usually use counting variables when I read the data in order and I know that the data cannot but change from the direction I start (i.e. it will remain constant until the point where I am now). Then I do something like:

 SELECT * FROM table ORDER BY created ASC LIMIT 0, 100 SELECT * FROM table ORDER BY created ASC LIMIT 100, 100 SELECT * FROM table ORDER BY created ASC LIMIT 200, 100 …

Consider this only work if we know for sure that there are no dates to be added to the place that we think we have already read. For example. using DESC will not work as data can be inserted. Then the whole bias will break, and we will read one article twice and never read a new article.

If you cannot make sure that the external count variable remains consistent, you can add a new parsed field that you set to true as soon as you read the comment. Then you can always see which comments have already been read and which are not. The SQL query will look like this:

 SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */

This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise, you will need to make sure that reading + true is atomic (synchronized).

Learn the most popular words in MySQL / PHP

Example

Slight improvement

More articles: