If I understand your question correctly, you want to group the tags together and then put the channels in these clusters based on the tags in the feed.
To do this, you can create a similarity dimension between tags based on the number of feeds that tags appear together. For your example, it will be something like this
#earthquake |
Here the value in (i,j) is equal to frequency of (i,j)/frequency of (i) .
Now you have a similarity matrix between the tags, and you can almost any clustering algorithm that suits your needs. Since the number of tags can be very large and it is difficult to estimate the number of clusters before running the algorithm, I would suggest using some hierarchical clustering algorithm, such as Fast Modularity clustering, which is also very fast ( see some details here ). However, if you have an estimate of the number of clusters that you would like to split into this, then spectral clustering can also be useful ( see here for details ).
After merging tags together, you can use a simple approach to assign each ribbon to a cluster. It can be very simple, for example, counting the number of tags from each cluster in the feed and assigning a cluster with the maximum number of matching tags.
If you are flexible in your clustering strategy, you can also try to combine joins together in a similar way, creating similarities between feeds based on the number of common tags between feeds and then applying a clustering algorithm to the similarity matrix.
source share