NLP: building (small) buildings or "Where to get a lot of not too specialized text files in English?"

Does anyone have a suggestion on where to find archives or collections of everyday English text for use in a small case? I am using the Gutenberg project books for a working prototype and would like to include a more modern language. A recent answer indirectly touched a huge archive of Usenet movie reviewswhich did not occur to me, and it is very good. For this particular program, usenet technical archives or programming mailing lists can tilt the results and analyze more difficultly, but any useful general blog text or chat scripts or something useful for others. In addition, a partial or downloadable research building, which is not very noticeable, or some kind of heuristic approach to finding a suitable subset of Wikipedia articles, or any other idea, is very much appreciated.

(By the way, I am a good citizen with downloading, using a deliberately slow script that does not require servers that host such material in case you take a moral hazard, pointing to something huge.)

UPDATE . The S0rin user indicates that Wikipedia does not require a workaround and instead provides this export tool . In the Gutenberg project, the policy is indicated here , on the bottom line, try not to bypass, but if you need to: "Configure your robot to wait at least 2 seconds between requests.

UPDATE 2 Wikpedia dumps are the way to go, thanks to the defendants who pointed to them. I ended up using the English version here: http://download.wikimedia.org/enwiki/20090306/ and the Spanish dump is about half as much. This is some cleaning work, but it's worth it, and they contain a lot of useful data in the links.


+5
7

, : HTML- , RSS-.

, LDC .

+8

. API , , , . wget.

, RSS-. RSS, HTML- .

/ Usenet : AOLbonics Techspeak, .

Penn Treebank British National Corpus, . Corpora . , , Web Corpus.

, -. , , . , , .

+4

, , , Penn Treebank.

+1

. , . . , . ..

+1

. , , :

1) /.

2) , .

0

, , . , , - , , , . , " ", , , .

0

You can get quotes content (in limited form) here: http://quotationsbook.com/services/

This content is also hosted on Freebase.

0
source

All Articles