Search and download images in a Wikipedia dump

I am trying to find an exhaustive list of all the images on Wikipedia, which I can then filter to public. I downloaded SQL dumps from here:

http://dumps.wikimedia.org/enwiki/latest/

And studied the database schema:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png

I think I understand this, but when I select a sample image from the Wikipedia page, I can not find it anywhere in landfills. For instance:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

I did grep on the "dump", "imagelinks" and "page" images that are looking for "Carrizo_2a.JPG" and it was not found.

Are these landfills incomplete? I do not understand the structure? Is there a better way to do this?

In addition, to go one step: after I filtered my list and want to upload a large set of images (thousands), I saw several references to the fact that I need to do this from the mirror of the site to prevent overloading Wikipedia / Wikimedia. If you have any recommendations about this, this will be helpful.

+6
source share
1 answer

MediaWiki stores file data in two or three places, depending on how you calculate:

  • Actual metadata for current file versions is stored in the image table. This is probably what you primarily want; You will find the latest en.wikipedia help here .

  • Data for old revised file changes is moved to the oldimage table, which has basically the same structure as the image table. This table is also reset, the last one here .

  • Finally, each file also (usually) corresponds to a fairly ordinary regular wiki page in the namespace 6 ( File: . You will find them in XML dumps, as with any other page.

Oh, and the reason you don’t find the files that you linked to in the Wikipedia dumps in English is because they are from a shared repository in Wikimedia Commons. Instead, you will find them in the Commons data archives .

As for downloading actual files, here is (apparently) the official documentation. As far as I can tell, all that they mean is "mass download at the moment (as of September 2012), available from mirrors, but not offered directly from Wikimedia servers." the fact is that if you want all the images in tarball, you have to use a mirror. If you are only pulling out a relatively small number of millions of images on Wikipedia and / or Commons, it should be good to use Wikimedia servers directly.

Do not forget to show the main courtesy: send the user-agent string, identifying yourself and do not hit the servers too hard. In particular, I would recommend downloading downloads sequentially so that you only download the next file after you have finished the previous one. Not only is this easier to implement than parallel loading in any case, but it ensures that you will not be afraid more than your share of bandwidth, and allows the download speed to more or less automatically adapt to server loading.

Ps. Whether you are downloading files from a mirror or directly from Wikimedia servers, you will need to find out in which directory they are located. Typical Wikipedia file URLs are as follows:

 http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg 

where the β€œ wikipedia/en ” part identifies the Wikimedia project and language (for historical reasons, Commons is specified as β€œ wikipedia/commons ”) and the β€œ a/ab ” part is given by the first two hexadecimal digits of the MD5 hash of the file name in UTF-8 (since they are encoded in database dumps).

+10
source

All Articles