I think you will need:
- analyze an existing list of English words in wiktionary that were extracted from a database dump.
- load the database dump (and not just the headers) and extract the terms themselves.
I tried option a) only because option b) would imply downloading multiple GBs. It's very simple, in fact I am including a fast JS implementation that you can use as a base to create your own script in your preferred language.
var baseURL="http://en.wiktionary.org/wiki/Index:English/" var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for(i=0;i<letters.length;i++) { var letter = letters[i]; console.log(letter); $.get(baseURL+letter, function(response) { $(response).find('ol li a').each( function (k,v) { console.log(v.text) }) }) }
EDIT I was very curious about this, so I wrote a python script. Just in case someone finds this useful:
from lxml.cssselect import CSSSelector from lxml.html import fromstring import urllib2 url = 'http://en.wiktionary.org/wiki/Index:English/' letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for l in letters: req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"}) con = urllib2.urlopen( req ) response = con.read() h = fromstring(response) sel = CSSSelector("ol li a") for x in sel(h): print x.text.encode('utf-8')
I would insert the results into pastebin myself, but a 500k limit would not allow me