I want to do some text analysis in the program I am writing. I am looking for alternative sources of text in its original form, similar to what is contained in Wikipedia dumps (download.wikimedia.com).
I would not have to face the problem of crawling websites trying to parse html, extract text, etc.
What text are you looking for?
Many free e-books (fiction and non-fiction) are available in .txt format, available at Project Gutenberg .
DVD, .
NLTK API Python many , , , .
>>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
The gutenberg project has a huge number of e-books in various formats (including plain text)