How does Google News automatically categorize articles in Tech / Science / Health / Entertainment / etc?

Let's say I pick a random source such as CNN. It would be more beneficial to automatically sort the cleared articles into categories based on keywords or to scratch certain parts of the website for different categories, for example, cnn.com/tech or / entertainment. The second option is not easy to scale, I would not want to manually configure URLs for different sources. How does Google News solve this problem?

+8
algorithm machine-learning web-scraping google-news
source share
1 answer

Here is a 2005 Google patent

"Systems and methods for improving the ranking of news articles"

And update since 2012:

SYSTEMS AND METHODS FOR IMPROVING THE RANKING OF NEWS

If you want to create a simple system yourself, I would do something like this:

Take a bunch of news that is already related to sports / technology / whatever.

Label them in separate words and grams (short sequences of words).

Create a really large table with unique words and grams as columns and separate stories as rows:

StoryId Class word1 word2 gram1 gram2 ... 1 sports 0 0.2 0.01 0 2 tech 0.5 0.01 0 0.3 3 sports 0 0.1 0.3 0.01 

In cases where the values ​​in the cells represent the frequency, binary occurrence, or number of TF-IDF words in the documents.

Use a classification algorithm such as Naive Bayes or Vector Vector Machines to find out the weights of the columns relative to the class labels. This is called your model.

When you receive a new, unclassified document, perform the tokenization as before, apply the previously created model, and it will give you the most likely shortcut to the document class.

Here is my video series that includes a video on automatic document classification:

http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html

+8
source share

All Articles