News Articles Data Sets

I am doing a project in the classification of news. Basically, the system will classify news articles on the basis of a predetermined topic (for example, sports, political, international). To create a system, I need free datasets for training the system.

So far, after a few hours of googling and links from here , the only suitable datasets I could find are this . Although it will be, I hope, enough, I think I will try to find more.

Note that the datasets I want are:

  • Contains full news articles, not just the title
  • In English
  • In .txt format, not in XML or db format

Can someone help me?

+5
source share
2 answers

Have you tried using Reuters21578 ? This is the most common dataset for classifying text. It is formed in SGML, but it is quite simple to parse and convert to txt format.

+1
source

You can create it, you can write a Python / Perl / PHP script where you run the search, and then when you find the answers, you can highlight the attributes with a regular expression ... I think this is the best option. It’s not easy, but it should be fun, finally, you can share this data set with us.

0
source

All Articles