How can I copy [Tweets] short messages based on the subject? [Thematic clustering]

I am planning an application that will create clusters of short messages / tweets based on topics. The number of topics will be limited, like Sports [NBA, NFL, Cricket, Soccer], Entertainment [films, music], etc ...

I can think of two approaches to this.

  • Ask users to mark questions like Stackoverflow. Users can select tags from a predefined list of tags. Then, on the server side, I will cluster them based on tags. Pros: - Simple design. Less code complexity. Cons: - The choice for users will be limited. Clusters will not be dynamic. If a new event occurs, predefined tags will skip it.
  • Take the message, delete the stop words [predefined in the dictionary], apply some clustering algorithm on the original message to create the cluster, and display the cluster depending on its popularity. The cluster will be displayed until it becomes popular [many messages / minutes]. New messages will be discarded and assigned to the respective clusters. Pros: - Dynamic clustering is based on the popularity of the event / accident. Cons: - Increased complexity. More server resources are required.

I would like to know if there are any other approaches to this problem. Or are there any ways to improve the above methods?

Also suggest some good clustering algorithms. I think the K-Nearest Clustering algorithm is suitable for this situation.

+7
cluster-analysis tagging
source share
3 answers

Use Bayesian classification . Train the filter with some predefined housing and (optionally) give users the opportunity to further refine it by putting things that were incorrectly classified.

Here are some examples of using the Bayesian classifier in NLTK .

+2
source share

Check Carrot2, this tool extracts tags from text and clusters. You can download it from here and check the implemented algorithms (Lingo, mainly) here .

Hope this helps you.

+3
source share

I do that too. I think that hashtag is a good way if you are talking specifically about twitter. You can also perform some classification, but it should be enriched with some external knowledge base, such as Wikipedia, etc. In any case, if your solution is better, submit it here.

0
source share

All Articles