Sparql keys versus different values

I have a sparql query that returns duplicates, and I want it to clear them only for one of the values ​​(subjectID). Unlike DISTINCT, which seems to find a unique value for a combination of selected values, and not just for one of the parameters. I saw that someone here offers a group, but that only seems to be applicable if I list all the parameters after the group (my endpoint sparql complains, for example, a non-group key variable in SELECT:? Occupation). I tried to run the internal selection, but it does not seem to work for this particular request. So there may be a problem with the request itself (the values ​​of the activeIn option apparently cause duplication)?

Happy enough with relational databases at an early stage of learning using SPARQL, so feel free to explain the obvious to an otherwise uninitiated! :)

select distinct ?subjectID ?englishName ?sex ?locatedIn15Name ?dob ?dod ?dom ?bornLocationName ?occupation where { ?person a hc:Person ; hc:englishName ?englishName ; hc:sex ?sex; hc:subjectID ?subjectID; optional { ?person hc:livedIn11 ?livedIn11 . ?livedIn11 hc:englishName ?lived11LocationName . ?livedIn11 hc:locatedIn11 ?locatedIn11 . ?locatedIn11 hc:englishName ?locatedIn11Name . ?locatedIn11 hc:locatedIn15 ?locatedIn15 . ?locatedIn15 hc:englishName ?locatedIn15Name . } . optional {?person hc:born ?dob } . optional {?person hc:dateOfDeath ?dod } . optional {?person hc:dateOfMarriage ?dom } . optional { ?person hc:bornIn ?bornIn . ?bornIn hc:englishName ?bornLocationName . ?bornIn hc:easting ?easting . ?bornIn hc:northing ?northing } . optional { ?person hc:occupation ?occupation } FILTER regex(?englishName, "^FirstName LastName") } GROUP BY ?subjectID ?englishName ?sex ?locatedIn15Name ?dob ?dod ?dom ?bornLocationName ?occupation 
+7
source share
2 answers

Enter the error message:

Non-group key variable in SELECT :? an occupation

You can avoid this by using the SAMPLE() aggregate - this will allow you to simply group by ?subjectID , but still choose the values ​​for the rest of the variables if you only need to get one value for these other variables.

Here is a simple example:

 SELECT ?subjectID (SAMPLE(?dob) AS ?dateOfBirth) WHERE { ?person a hc:Person ; hc:subjectID ?subjectID . OPTIONAL { ?person hc:born ?dob } } GROUP BY ?subjectID 
+12
source

First of all, it should be noted that in RDF / SPARQL there is no such thing as a key. You are requesting a graph, and ?subjectID may just have several possible combinations of values ​​for the other variables that you select. This is due to the form of the chart you are requesting: it is possible that your person has more than one English name, or vice versa, vice versa: the same English name can be used by more than one person.

The SPARQL SELECT query is a strange beast: it queries the structure of the graph, but presents the result as a flat table (technically this is a sequence of sets of variable bindings, but it is the same thing). Duplicates occur due to the fact that different combinations of values ​​for your variables can be found, mainly following the different paths on the chart.

The fact that you get duplicate values ​​for the ?subjectID in your result is therefore inevitable, simply because it is, from the point of view of the RDF graph, unique solutions for your request. You cannot filter the results without losing any information, so it’s generally difficult to give you a decision without knowing more about which “duplicates” you want to discard: you only need one possible English name for each item or one possible date of birth (although there may be more than one in your data)?

However, here are some tips for more convenient processing / processing of such results:

First of all, you can use the ORDER BY in your ?subjectID variable. This will still give you multiple lines with the same value for ?subjectID , but they will all be fine so that you can handle the result more efficiently.

Another solution is to split your request into two: make the first request, which will select only all unique items (and, possibly, all other values ​​that you know about in advance, that they will be unique taking into account the item), and then iterate over the result and make a separate query to get other values ​​that interest you for each value of an individual object. This solution may sound like heresy (especially if you work against the background of SQL), but in reality it can be faster and easier than trying to do everything in one huge query.

Another solution is the proposal RobV suggested: using the SAMPLE aggregate for a particular variable, just select one (random) unique value. A variant of this is to use the GROUP_CONCAT aggregate, which creates a single value, combining all possible values ​​into a single row.

+9
source

All Articles