Easy way to get title titles in only one language?

Question

Easy way to get title titles in only one language?

I can easily get dump with all the headers in wiktionary, but this dump contains every word, even non-English.

For example, you will find souris ( mouse in French): https://en.wiktionary.org/wiki/souris

Is there an easy way or an existing script to get only headers in one particular language. I would like to get all the English words from the Victory, excluding those that do not exist in this language.

So far, my only idea is to parse the text and check if the string ==English== exists, but it is too slow to be used.

+7

mediawiki-api wiktionary

Andreas Schwarz Mar 18 '13 at 12:45

source share

3 answers

The releases of the solutions and SERAN code examples were great, but I was not able to run its code for python.

I followed his example and wrote a ruby version:

 #!/usr/bin/env ruby require 'net/http' require "rexml/document" url = 'http://en.wiktionary.org/wiki/Index:English/' ('a'..'z').to_a.each do |letter| response = Net::HTTP.get(URI(url + letter)) doc = REXML::Document.new(response) REXML::XPath.each(doc, "//ol/li/a") do |element| puts element.text end end

0

Justin Nov 20 '13 at 5:14

source share

Following @serans answer, I created a GitHub Gist to do the same in Swift

https://gist.github.com/ashleymills/549ab8aff05ec90f4350#file-wiktionaryfetcher-swift

0

Ashley mills Feb 03 '15 at 14:38

source share

serans · Accepted Answer · 2013-03-18T13:42:25+0000

I think you will need:

analyze an existing list of English words in wiktionary that were extracted from a database dump.
load the database dump (and not just the headers) and extract the terms themselves.

I tried option a) only because option b) would imply downloading multiple GBs. It's very simple, in fact I am including a fast JS implementation that you can use as a base to create your own script in your preferred language.

 var baseURL="http://en.wiktionary.org/wiki/Index:English/" var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for(i=0;i<letters.length;i++) { var letter = letters[i]; console.log(letter); $.get(baseURL+letter, function(response) { $(response).find('ol li a').each( function (k,v) { console.log(v.text) }) }) }

EDIT I was very curious about this, so I wrote a python script. Just in case someone finds this useful:

 from lxml.cssselect import CSSSelector from lxml.html import fromstring import urllib2 url = 'http://en.wiktionary.org/wiki/Index:English/' letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for l in letters: req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"}) con = urllib2.urlopen( req ) response = con.read() h = fromstring(response) sel = CSSSelector("ol li a") for x in sel(h): print x.text.encode('utf-8')

I would insert the results into pastebin myself, but a 500k limit would not allow me

Easy way to get title titles in only one language?

More articles: