The simplest approach that I assume would be:
- Create two new tables:
keywords (id, word) and keywords_comments (keyword_id, comment_id, count)keywords stores the unique identifier and keyword that you found in the textkeywords_comments stores one line for each connection between each comment containing this keyword. In count you save the number of times this keyword occurred in a comment. The two columns keyword_id + comment_id together form a unique or directly primary key.
- Get all comments from the database
- Analyze all comments and divide them into non-characters (or other boundaries).
- Record these entries in your tables.
Example
You have the following two comments:
Hi how are you?!
Wow, hello. My name is Stefan.
Now you will iterate over both of them and divide them into non-characters. This will result in the following lowercase words for each text: - First text: hello, how, you, - Second text: wow, hello, my, name, is, stefan
Once you have parsed one of these texts, you can already insert it into the database again. I think you do not want to load 100,000 comments into RAM.
So it would turn out:
- Parse the first text and get the keywords above
- Write each keyword in the tabke
keywords if it does not already exist - Set the link from the keyword to the comment (
keywords_comments ) and configure the calculation correctly (in our example, each word occurs only once in each text, you must take this into account). - Parsing the second text
- ...
Slight improvement
A very simple improvement that you should probably use for 100,000 comments is to use a count variable or add a new has_been_analyzed field for each comment. Then you can read their commentary on the comments from the database.
I usually use counting variables when I read the data in order and I know that the data cannot but change from the direction I start (i.e. it will remain constant until the point where I am now). Then I do something like:
SELECT * FROM table ORDER BY created ASC LIMIT 0, 100 SELECT * FROM table ORDER BY created ASC LIMIT 100, 100 SELECT * FROM table ORDER BY created ASC LIMIT 200, 100 β¦
Consider this only work if we know for sure that there are no dates to be added to the place that we think we have already read. For example. using DESC will not work as data can be inserted. Then the whole bias will break, and we will read one article twice and never read a new article.
If you cannot make sure that the external count variable remains consistent, you can add a new parsed field that you set to true as soon as you read the comment. Then you can always see which comments have already been read and which are not. The SQL query will look like this:
SELECT * FROM table WHERE analyzed = 0 LIMIT 100 /* Reading chunks of 100 */
This works as long as you do not parallelize the workload (with multiple clients or threads). Otherwise, you will need to make sure that reading + true is atomic (synchronized).