Querying data about shared data data using SPARQL

Question

Querying data about shared data data using SPARQL

I am trying to get some information from the lower levels of super output (LSOAs) and UK Postal Codes .

I need a zip code and lsoa info in a data dump for using excel.

Designation and label of type "Super-exit area of the lower layer". http://opendatacommunities.org/doc/geography/lsoa/E01009437

eg. 'lsoa' for each type of 'Postal Code' http://opendatacommunities.org/resource?uri=http%3A%2F%2Fdata.ordnancesurvey.co.uk%2Fid%2Fpostcodeunit%2FB721NB

I have no idea how to use the SPARQL mechanism on the site to get this information or how to extract information from the N-Triples file that I downloaded ...

+3

rdf n-triples sparql

user1514580 May 17, '13 at 11:48

source share

2 answers

Joshua taylor · Answer 1 · 2013-05-17T13:46:15+0000

There are two main ways to obtain the required data. In some cases, you can query data using the public SPARQL endpoint. This is probably the most convenient approach, and one that needs to be taken if there is no specific reason why you need data locally. However, there are limitations to this approach, and in these cases it makes sense to upload a dataset and query them locally. First I will describe the solution of the remote endpoint, and then the solution using local queries. Restrictions on the SPARQL endpoint (for example, hard timeouts) mean that the first approach is not enough for this specific task, so the specific answer to this question is the second approach.

I was not familiar with these specific datasets and ontologies before this question, so the first approach also goes through, although the process of "getting to know the data".

Using the SPARQL Endpoint

There is an Open Data Communities SPARQL endpoint with which you can run queries and retrieve some data. I have not reviewed this data before, so instead of just posting the final answer, I will go through the process that I used to determine which request to run.

One of the pages you linked to, B72 1NB , mentions that the resource is of type PostcodeUnit , which has a URI

http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit

Based on this, the first thing I tried was a SPARQL query to try to extract some zip codes, so I used the following query at the endpoint above. (If you copy and paste it there, you will need to delete any leading space before SELECT . I still had to do this.)

 SELECT * WHERE { ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> } LIMIT 10

SPARQL Results

at the endpoint linked above. ( LIMIT helps ensure that results return in a timely manner, and that we don’t ask the server to do too much.) This produces results such as

 -------------------------------------------------------------- | postcodeUnit | ============================================================== | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY256SA> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY209DR> | --------------------------------------------------------------

Page B72 1NB shows lsoa as Birmingham 006C . IRI for the lsoa property (and you can see it in the data you uploaded)

 http://opendatacommunities.org/def/geography#lsoa

therefore we extend the SPARQL query to

 SELECT * WHERE { ?postcodeUnit a <http://data.ordnancesurvey.co.uk/ontology/postcode/PostcodeUnit> ; <http://opendatacommunities.org/def/geography#lsoa> ?lsoa . } LIMIT 10

SPARQL Results

The results look like this:

 ----------------------------------------------------------------------------------------------------------------------------- | postcodeUnit | lsoa | ============================================================================================================================= | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA219HB> | <http://opendatacommunities.org/id/geography/lsoa/E01029309> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF109DS> | <http://opendatacommunities.org/id/geography/lsoa/E01029706> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY147HR> | <http://opendatacommunities.org/id/geography/lsoa/E01018373> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF107BZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014172> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY134LH> | <http://opendatacommunities.org/id/geography/lsoa/E01018514> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA202HF> | <http://opendatacommunities.org/id/geography/lsoa/E01029175> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SY44QZ> | <http://opendatacommunities.org/id/geography/lsoa/E01014204> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TA116SS> | <http://opendatacommunities.org/id/geography/lsoa/E01029225> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/SW65TP> | <http://opendatacommunities.org/id/geography/lsoa/E01001950> | | <http://data.ordnancesurvey.co.uk/id/postcodeunit/TF15AX> | <http://opendatacommunities.org/id/geography/lsoa/E01014155> | -----------------------------------------------------------------------------------------------------------------------------

You can use prefixes in your request if you want to make it more readable and concise:

 PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/> PREFIX geo: <http://opendatacommunities.org/def/geography#> SELECT * WHERE { ?postcodeUnit a pc:PostcodeUnit ; geo:lsoa ?lsoa . } LIMIT 10

SPARQL Results

The results will be the same, of course. At the bottom of each of these results pages, you can download the results in a number of other formats. One of the formats is CSV, and you may need to import it directly into a spreadsheet (you said you want to use data in Excel).

A discussion in the comments indicated that a pure PostcodeUnit number makes a lot of results very large. In UK Postcodes, a dataset contains four types of resources in ascending order of size: zip codes, zip codes, zip codes, and zip codes. There are 1686911, 10833, 2087 and 120 resources of these types, respectively. As far as I understand the explanations in the comments, the idea is to associate them with the lower output super output areas (LSOAs), for example, Birmingham 006C . Some postal codes are associated with LSOAs, but there are no higher-level postal code regions. Each zip code block within its sector, district and region. For example, TA21 9HB is within TA, TA21 9 and TA21. Using this information, we can request the postal codes and their respective district (or sector or region), as well as their LSOA and report only the district and LSOA, ignoring the device itself. For example:

 PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/> PREFIX geo: <http://opendatacommunities.org/def/geography#> PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/> SELECT DISTINCT ?district ?lsoa WHERE { ?postcodeunit a pc:PostcodeUnit ; geo:lsoa ?lsoa ; sr:within ?district . ?district a pc:PostcodeDistrict . } LIMIT 10

SPARQL Results

There are now 34378 LSOAs in the dataset , so there is still a lot of data to choose, and trying to pull out the text results for all the different comparisons losa / district still leads to a timeout. In fact, since each LSOA is associated (I expect) with a certain area, there are probably as many results as there are LSOAs.

It seems like this is the moment when we start to remove response size limits and timeouts for the SPARQL endpoint, and we need to start accessing the data locally. However, only zip code data is 5.6 GB, so this is not a great solution.

But if you want to take an LSOA representative for each district, we can use the SPARQL subqueries to pull them out, as in the next query, which first retrieves all the zip codes and then finds one LSOA for each of them that has some zip code in the district . I don’t know if this is an acceptable result, but in the end you have LSOA for each district, and the results are quite small (there are 2087 lines, the same as the number of districts) that they can be carried in any of the formats results (including CSV).

 PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/> PREFIX geo: <http://opendatacommunities.org/def/geography#> PREFIX sr: <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/> SELECT ?region ?lsoa WHERE { { SELECT ?region WHERE { ?region a pc:PostcodeDistrict . } } { SELECT ?lsoa WHERE { ?postcodeunit a pc:PostcodeUnit ; geo:lsoa ?lsoa ; sr:within ?region . } LIMIT 1 } }

SPARQL Results

Using a local database

There are limitations to using the SPARQL endpoint, such as the timeouts found above. In these situations, it is not difficult to upload data and upload it to the Jena TDB repository and query using tdbquery . On page for zipped n-triples . After downloading this data (and installing Apache Jena 2.10 , I ran (on a Unix system):

 $ tdbloader2 --loc tdb dataset_data_postcodes_20130506183000.nt

where tdb is the local directory that I do to contain TDB indexes. Loading data takes some time (1125 seconds here), as well as indexing. When everything is loaded, I saved the following query in a file called postcodes.sparql and fulfilled the query with

 $ tdbquery --loc tdb --results CSV --query postcodes.sparql > unit_lsoa.csv

to generate CSV results stored in unit_lsoa.csv file. Here are the first few lines:

 $ head -5 unit_lsoa.csv postcodeUnit,lsoa http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AE,http://opendatacommunities.org/id/geography/lsoa/E01023667 http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AG,http://opendatacommunities.org/id/geography/lsoa/E01023741 http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AJ,http://opendatacommunities.org/id/geography/lsoa/E01023741 http://data.ordnancesurvey.co.uk/id/postcodeunit/AL11AR,http://opendatacommunities.org/id/geography/lsoa/E01023684

Now zip code units of 1686911 have been defined, so initially it was expected that unit_lsoa.csv would have the same number of lines. However, their number is 200,000 less. ( wc -l prints the number of lines in a file.)

 $ wc -l unit_lsoa.csv 1440143 unit_lsoa.csv

As it turns out, some of the postcode blocks are not associated with LSOAs. I checked this by running a query

 PREFIX pc: <http://data.ordnancesurvey.co.uk/ontology/postcode/> PREFIX geo: <http://opendatacommunities.org/def/geography#> SELECT * WHERE { ?postcodeUnit a pc:PostcodeUnit . FILTER NOT EXISTS { ?postcodeUnit geo:lsoa ?lsoa } }

stored in postcodes_without_lsoa.sparql file:

 $ tdbquery --loc tdb \ --results CSV \ --query postcodes_without_lsoa.sparql > unit_without_lsoa.csv

Of course, in unit_without_lsoa.csv there are about 200,000 lines:

 $ wc -l unit_without_lsoa.csv 246770 unit_without_lsoa.csv

The sum of 1440143 and 246770 is 1686913, which corresponds to the number of zip codes (plus 2 lines for the headers in each CSV file). Mission accomplished!

Jeryl cook · Answer 2 · 2015-08-23T00:25:02+0000

you can use the web service to obtain this information, which you can associate with the UK zip code (e.g. ZE1 0AE), the border of the sector, district, city and chamber.

https://www.mashape.com/vanitysoft/uk-boundaries-io

Here is an example from a TA2 Postal District request that returns the collection of polygons (GeoJson) of the sectors making up the TA2 area.

Querying data about shared data data using SPARQL

Using the SPARQL Endpoint

Using a local database

More articles: