Tag Based Clustering Algorithm

I am going to copy a lot of channels based on their tags. A typical example would be twitter. Each channel will have user-defined tags associated with it. By analyzing tags, you can copy channels into different groups and tell how many feeds are based on so many tags. An example would be:

  • Feed1 - Indonesia Earthquake #earthquake #asia #bad
  • Feed2 - In my area a large earthquake #earthquake #bad
  • Feed3 - My parents went to Singapore #asia #tour
  • Feed4 - XYZ Fires Many People #XYZ #layoff #bear
  • Feed5 - XYZ is getting poorly planning to lay off #XYZ #layoff #bad
  • Feed6 - XYZ is in dismissal mode #layoff #XYZ #worst

After clustering

  • #asia, #earthquake - Feed1, Feed2
  • #XYZ, #fire - Feed4, Feed 5, Feed6

Here, clustering is found based only on tags. Is there a good algorithm to achieve this

+4
source share
2 answers

If I understand your question correctly, you want to group the tags together and then put the channels in these clusters based on the tags in the feed.

To do this, you can create a similarity dimension between tags based on the number of feeds that tags appear together. For your example, it will be something like this

#earthquake | #asia | #bad | ... #earthquake 1 | 1/2 | 2/2 #asia 1/2 | 1 | 1/2 #bad 2/3 | 1/3 | 1 ... 

Here the value in (i,j) is equal to frequency of (i,j)/frequency of (i) .

Now you have a similarity matrix between the tags, and you can almost any clustering algorithm that suits your needs. Since the number of tags can be very large and it is difficult to estimate the number of clusters before running the algorithm, I would suggest using some hierarchical clustering algorithm, such as Fast Modularity clustering, which is also very fast ( see some details here ). However, if you have an estimate of the number of clusters that you would like to split into this, then spectral clustering can also be useful ( see here for details ).

After merging tags together, you can use a simple approach to assign each ribbon to a cluster. It can be very simple, for example, counting the number of tags from each cluster in the feed and assigning a cluster with the maximum number of matching tags.

If you are flexible in your clustering strategy, you can also try to combine joins together in a similar way, creating similarities between feeds based on the number of common tags between feeds and then applying a clustering algorithm to the similarity matrix.

+5
source

Interest Ask. I am doing everything here, but I think it will work.

Algorithm

For each feed, display a complete list of tag combinations (lengths> = 2), which are probably sorted for consistency. For instance:

  • Feed1: (asia-bad), (asia-earthquake), (bad earthquake), (asia-bad earthquake)
  • Feed2: (bad earthquake)
  • Feed3: (asia tour)
  • Feed4: (bear retention), (bear-XYZ), (dismissal-XYZ), (bear-dismissal-XYZ)
  • Feed5: (bad dismissal), (bad-XYZ), (dismissal-XYZ), (bad-layoff-XYZ)
  • Feed6: (slaughter-worst), (dismissal-XYZ), (worst-XYZ), (dismissal-worst-XYZ)

Then change the display:

  • (asia-bad): Feed1
  • (asia earthquake): Feed1
  • (bad earthquake): Feed1, Feed2
  • (Asia is a bad earthquake): Feed1
  • (asia tour): Feed3
  • (bear carry): Feed4
  • ...
  • (dismissal-XYZ): Feed4, Feed5, Feed6
  • ...

Then you can select all records with a frequency above a certain threshold. In this case, if we take the frequency threshold 2, you will get (a bad earthquake) with Feed1 and Feed2 and (firing-XYZ) with Feed4, Feed5 and Feed6.

Performance issues

A naive implementation of this will have extremely low performance - an exponential number of tags per channel (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For instance:

  • Identify the most popular X tags by scanning all channels (or random selection of X-channels) - this is linear in the number of tags per channel. Then consider only the Y most popular tags for each feed.
  • Determine the frequency of all (or most) tags. Then for each post, consider only the X most popular tags in this post. This prevents situations where you have, say, fifteen tags for some post, which leads to a huge list of combinations, most of which will never happen.
  • For each message, consider only combinations of length & lt = X. For example, if there were fifteen tags in the feed, you could get a huge number of combinations, but most of them would have very few cases, especially long ones. Therefore, we consider only combinations of two or three tags.
  • Check only random selection of X-channels.

Hope this helps!

+2
source

All Articles