Easy way to get title titles in only one language?

I can easily get dump with all the headers in wiktionary, but this dump contains every word, even non-English.

For example, you will find souris ( mouse in French): https://en.wiktionary.org/wiki/souris

Is there an easy way or an existing script to get only headers in one particular language. I would like to get all the English words from the Victory, excluding those that do not exist in this language.

So far, my only idea is to parse the text and check if the string ==English== exists, but it is too slow to be used.

+7
source share
3 answers

I think you will need:

  • analyze an existing list of English words in wiktionary that were extracted from a database dump.
  • load the database dump (and not just the headers) and extract the terms themselves.

I tried option a) only because option b) would imply downloading multiple GBs. It's very simple, in fact I am including a fast JS implementation that you can use as a base to create your own script in your preferred language.

 var baseURL="http://en.wiktionary.org/wiki/Index:English/" var letters=['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for(i=0;i<letters.length;i++) { var letter = letters[i]; console.log(letter); $.get(baseURL+letter, function(response) { $(response).find('ol li a').each( function (k,v) { console.log(v.text) }) }) } 

EDIT I was very curious about this, so I wrote a python script. Just in case someone finds this useful:

 from lxml.cssselect import CSSSelector from lxml.html import fromstring import urllib2 url = 'http://en.wiktionary.org/wiki/Index:English/' letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] for l in letters: req = urllib2.Request(url+l, headers={'User-Agent' : "Magic Browser"}) con = urllib2.urlopen( req ) response = con.read() h = fromstring(response) sel = CSSSelector("ol li a") for x in sel(h): print x.text.encode('utf-8') 

I would insert the results into pastebin myself, but a 500k limit would not allow me

+5
source

The releases of the solutions and SERAN code examples were great, but I was not able to run its code for python.

I followed his example and wrote a ruby ​​version:

 #!/usr/bin/env ruby require 'net/http' require "rexml/document" url = 'http://en.wiktionary.org/wiki/Index:English/' ('a'..'z').to_a.each do |letter| response = Net::HTTP.get(URI(url + letter)) doc = REXML::Document.new(response) REXML::XPath.each(doc, "//ol/li/a") do |element| puts element.text end end 
0
source

Following @serans answer, I created a GitHub Gist to do the same in Swift

https://gist.github.com/ashleymills/549ab8aff05ec90f4350#file-wiktionaryfetcher-swift

0
source

All Articles