Blindly classify new trends in incoming data

How do news stories, such as Google News, automatically classify and rank documents on emerging topics, such as "obama 2011 budget"?

I have a bunch of articles with baseball data tags, such as player names and article relevance (thanks, opencalais), and I would really like to create a Google-style news interface that evaluates and displays new posts as they arrive, especially emerging topics. I believe that a naive bike classifier can be trained with some static categories, but it really does not allow tracking trends such as "this player has just been sold to this team, these other players have also been involved."

+5
source share
2 answers

Undoubtedly, Google News can use other tricks (or even a combination of them), but one relatively cheap trick, from the point of view of computer technology, will use the concept of NLP to derive topics from free text, that a word matters only to it when connected to other words .
An algorithm capable of detecting new categories of topics from several documents can be characterized as follows:

  • POS (part of speech) tag text
    We probably want to focus more on nouns and perhaps even more on named entities (such as Obama or New England).

  • , . , Named Entity (: Parisian == > Paris, legal == > )
    , .
  • "/ " (Superbowl, Elections, scandal...)
    , N-.
  • N-, ( N 1, 4 5)
    N- , N-
  • N- (.. , ), , .
  • ( )
  • []

. , (, cnn/sports vs. cnn/policy...) , . / ( ).

+4

Google Google:

+2

All Articles