I am planning an application that will create clusters of short messages / tweets based on topics. The number of topics will be limited, like Sports [NBA, NFL, Cricket, Soccer], Entertainment [films, music], etc ...
I can think of two approaches to this.
- Ask users to mark questions like Stackoverflow. Users can select tags from a predefined list of tags. Then, on the server side, I will cluster them based on tags. Pros: - Simple design. Less code complexity. Cons: - The choice for users will be limited. Clusters will not be dynamic. If a new event occurs, predefined tags will skip it.
- Take the message, delete the stop words [predefined in the dictionary], apply some clustering algorithm on the original message to create the cluster, and display the cluster depending on its popularity. The cluster will be displayed until it becomes popular [many messages / minutes]. New messages will be discarded and assigned to the respective clusters. Pros: - Dynamic clustering is based on the popularity of the event / accident. Cons: - Increased complexity. More server resources are required.
I would like to know if there are any other approaches to this problem. Or are there any ways to improve the above methods?
Also suggest some good clustering algorithms. I think the K-Nearest Clustering algorithm is suitable for this situation.
cluster-analysis tagging
Jagira
source share