Creating different datasets from a live dbpedia dump

I played with the various datasets presented on the dbpedia download page and found that this was kind of outdated.

Then I downloaded the latest dump from dbpedia live . When I extracted the file on June 30th, I just got one huge 37GB.nt file.

I want to get different data sets (for example, different .nt files available on the download page) from the last dump. Is there a script or process for this?

+4
source share
1 answer

Solution 1:

You can use dbpedia live extractor. https://github.com/dbpedia/extraction-framework . You need to set up the correct extractors (Ex: infobox extractor, abstract extractor..etc). It will download the latest Wikipedia dumps and create dbpedia datasets.

You may need to make some changes to the code in order to get only the necessary data. One of my colleagues did this for German datasets. To do this, you still need a lot of disk space.

Solution 2 (I don't know if this is really possible or not.):

Make grep for the required properties in the datasets. You need to know the exact URI of the properties you want to get.

ex: for all home pages: bzgrep ' http://xmlns.com/foaf/0.1/homepage ' dbpedia_2013_03_04.nt.bz2> homepages.nt

This will give you all N-triples with homepages. You can upload this to the rdf repository.

+1
source

All Articles