-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iteration over hierarchical documents: need to access fields outside the iteration #20
Comments
I leave here a link to an issue about join conditions in rmlmapper-java implementation (RMLio/rmlmapper-java#28). My last comment was about this same problem which I see troublesome, specifically in the case of JSON files. However, in older RML reference implementation (a.k.a., RML-Mapper) it seems to be working the other way round, so it was possible to keep this hierarchical information. Maybe this would be the right place to discuss if it is better to offer this implicitly or that the user have to declare it explicitly. |
We need some systematic way to refer to parent and neighbor XML nodes, like what's available in xpath and xquery. To appreciate the complexity, I'll show a bit (5-10%) of our conversion of clinicaltrials.gov XML (CT) to RDF (using a custom ontology Here is an outline. On the left is the CT element/attribute hierarchy, and on the right a mapping to props, classes and literals:
The "turtle with embedded fields" below shows the nodes and connectivity between them, and xpath to illustrate where the data comes from: <(nct_id)/baseline/measure/($n)> a cto:Measure;
puml:label "/clinical_study/clinical_results/baseline/\n measure_list/measure";
dc:title "(title)";
dc:description "(description)";
cto:population "(population)";
cto:units "(units)";
cto:unitsAnalyzed "(units_analyzed)";
cto:param "(param)";
cto:dispersion "(dispersion)";
cto:analyzed <(nct_id)/baseline/measure/($n)/analyzed/($m)>;
cto:class <(nct_id)/baseline/measure/($n)/class/($m)>.
<(nct_id)/baseline/measure/($n)/class/($m)> a cto:Class;
puml:label "/clinical_study/clinical_results/baseline/\n measure_list/measure/class_list/class";
dc:title "(title)";
cto:analyzed <(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)>;
cto:category <(nct_id)/baseline/measure/($n)/class/($m)/category/($p)>.
<(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)> a cto:Analyzed;
puml:label "/clinical_study/clinical_results/baseline/\n measure_list/measure/class_list/class/analyzed_list/analyzed";
cto:units "(units) # all are 'Participants'";
cto:scope "(scope)";
cto:participants <(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)/participants/($m)>.
<(nct_id)/baseline/measure/($n)/class/($m)/analyzed/($p)/participants/($m)> a cto:ParticipantsCount;
puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n class_list/class/analyzed_list/analyzed/count_list/count";
dc:title "(.) # most often missing";
cto:group <(nct_id)/group/(@group_id/substring(.,2))>;
cto:count "(@value)^^xsd:integer".
<(nct_id)/baseline/measure/($n)/class/($m)/category/($p)> a cto:MeasureCategory;
puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n class_list/class/category_list/category";
dc:title "(title)";
cto:measurement <(nct_id)/baseline/measure/($n)/class/($m)/category/($p)/measurement/($q)>.
<(nct_id)/baseline/measure/($n)/class/($m)/category/($p)/measurement/($q)> a cto:Measurement;
puml:label "/clinical_study/clinical_results/baseline/measure_list/measure/\n class_list/class/category_list/category/measurement_list/measurement";
dc:title "(.)";
cto:group <(nct_id)/group/(@group_id/substring(.,2))>;
cto:value "(@value)^^xsd:decimal";
cto:spread "(@spread)^^xsd:decimal";
cto:lowerLimit "(@lower_limit)^^xsd:decimal";
cto:upperLimit "(@upper_limit)^^xsd:decimal". In the URL Let me know if you'd like to see a diagram of the complete model. We have implemented this conversion with XSPARQL, which is XQuery plus templates to emit triples: declare function local:rdf_measure ($url as xs:string, $base as xs:string, $meas as xs:string, $measure) {
let $url := $url
construct {
<{$base}> cto:measure <{$meas}>.
<{$meas}> a cto:Measure;
dc:title {$measure/title/text()};
dc:description {$measure/description/text()};
cto:population {$measure/population/text()};
cto:units {$measure/units/text()};
cto:unitsAnalyzed {$measure/units_analyzed/text()};
cto:param {$measure/param/text()};
cto:dispersion {$measure/dispersion/text()}.
{
for $i at $n in $measure/analyzed_list/analyzed return
local:rdf_analyzed ($url, $meas, fn:concat($meas,"/analyzed/",$n), $i),
for $i at $n in $measure/class_list/class return
local:rdf_class ($url, $meas, fn:concat($meas,"/class/",$n), $i)
}
}
};
declare function local:rdf_class ($url as xs:string, $meas as xs:string, $cls as xs:string, $class) {
let $url := $url
construct {
<{$meas}> cto:class <{$cls}>.
<{$cls}> a cto:Class;
dc:title {$class/title/text()}.
{
for $i at $n in $class/analyzed_list/analyzed return
local:rdf_analyzed ($url, $cls, fn:concat($cls,"/analyzed/",$n), $i),
for $i at $n in $class/category_list/category return
local:rdf_measureCategory ($url, $cls, fn:concat($cls,"/category/",$n), $i)
}
}
};
declare function local:rdf_measureCategory ($url as xs:string, $meas as xs:string, $cat as xs:string, $category) {
let $url := $url
construct {
<{$meas}> cto:category <{$cat}>.
<{$cat}> a cto:MeasureCategory;
dc:title {$category/title/text()}.
{
for $i at $n in $category/measurement_list/measurement return
local:rdf_measurement($url, $cat, fn:concat($cat,"/measurement/",$n), $i)
}
}
};
declare function local:rdf_measurement ($url as xs:string, $cat as xs:string, $meas as xs:string, $measurement) {
let $url := $url
construct {
<{$cat}> cto:measurement <{$meas}>.
<{$meas}> a cto:Measurement;
cto:group <{fn:concat($url, "/group/", $measurement/@group_id/substring(.,2))}>;
dc:title {$measurement/text()};
cto:value {func:clean_number($measurement/@value/string())}^^xsd:decimal;
cto:spread {$measurement/@spread /string()};
cto:lowerLimit {func:clean_number($measurement/@lower_limit/string())}^^xsd:decimal;
cto:upperLimit {func:clean_number($measurement/@upper_limit/string())}^^xsd:decimal;
}
}; How would you express this in RML? |
let me quote you
do you think this is an RML issue or an R2RML issue? I guess that above comes from hierarchical data, thus RML, but outside the iteration touches R2RML as well. |
Hi @andimou, indeed the solution we seeking has to be generic enough to capture various use cases. But this need comes as soon as we deal with hierarchical data, hence the case of RML with the rml:iterator that modifies the iteration model for the scope of a triples map. Even more in xR2RML where a where nestedTermMaps can change the iteration model for the scope of a single term map. Now, wrt. the words we use, above/down or outside/inside, probably the latter is more general indeed. In this sense, the choice of the name xrr:pushDown was pragmatic but probably not so clever. I'm wondering whether we could have cases where data would not be tabular nor hierarchical. If we query a graph database (other than an RDF database of course), the iteration could be on some sets of nodes that match a certain query pattern within the graph. Then I'm not sure what the "outside" term would mean here, but it is certainly more generic that the "above" one. |
+1 @frmichel I'm trying to understand the Given {
"records": [
{
"id": "1",
"enteredBy": "Alice",
"cars": [
{
"make": "Mercedes"
},
{
"make": "Honda"
}
]
},
{
"id": "2",
"enteredBy": "Bob",
"cars": [
{
"make": "Mercedes"
},
{
"make": "Toyota"
}
]
}
]
} and logical source:
What would the resulting iteration look like? Would it be [
{
"recordId": "1",
"make": "Mercedes"
},
{
"recordId": "1",
"make": "Honda"
},
{
"recordId": "2",
"make": "Mercedes"
},
{
"recordId": "2",
"make": "Toyota"
}
] Or [
{
"recordId": [
"1",
"2",
],
"make": "Mercedes"
},
{
"recordId": [
"1",
"2",
],
"make": "Honda"
},
{
"recordId": [
"1",
"2",
],
"make": "Mercedes"
},
{
"recordId": [
"1",
"2",
],
"make": "Toyota"
}
] I can see how
|
I think this can occur in the case of tabular data too, e.g., when you want to refer to the 'above' line of a CSV |
Hi @pmaria, The case you describe is interesting because it involves two iterations, not just one: one iteration on records, and one on cars of records. The way the pushDown feature works is quite simple: at each iteration, it evaluates the xrr:reference expression and creates additional fields with the result of this evaluation. So in your example:
But what you want to do here is to push the id field two iteration levels down. In xR2RML you can do that with an iterator and a nested term map.
That basically iterates on each individual record:
Then, in your predicate-object maps you can iterate on cars:
This takes you to case A and creates object terms: |
@frmichel, Ah ✔️ , of course. |
A need that was discussed a couple of time: during the iteration over hierarchical documents (e.g. with rml:iterator), it is no longer possible to access the fields above the iteration, or in other words outside of the iterated part.
Although this is sometimes necessary, typically to build unique identifiers using ids at different hierarchical levels.
xR2RML proposes one solution to do that using the "pushDown" property.
The text was updated successfully, but these errors were encountered: