Where can I find a dump of raw text on the Internet?

I want to do some text analysis in the program I am writing. I am looking for alternative sources of text in its original form, similar to what is contained in Wikipedia dumps (download.wikimedia.com).

I would not have to face the problem of crawling websites trying to parse html, extract text, etc.

+5
source share
3 answers

What text are you looking for?

Many free e-books (fiction and non-fiction) are available in .txt format, available at Project Gutenberg .

DVD, .

+7

NLTK API Python many , , , .

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
+3

The gutenberg project has a huge number of e-books in various formats (including plain text)

0
source

All Articles