Undoubtedly, Google News can use other tricks (or even a combination of them), but one relatively cheap trick, from the point of view of computer technology, will use the concept of NLP to derive topics from free text, that a word matters only to it when connected to other words .
An algorithm capable of detecting new categories of topics from several documents can be characterized as follows:
- POS (part of speech) tag text
We probably want to focus more on nouns and perhaps even more on named entities (such as Obama or New England).
, . , Named Entity (: Parisian == > Paris, legal == > )
, .- "/ " (Superbowl, Elections, scandal...)
, N-. - N-, ( N 1, 4 5)
N- , N- - N- (.. , ), , .
- ( )
- []
. , (, cnn/sports vs. cnn/policy...) , . / ( ).