Generalise SparqlDataGen #10

riccardotommasini · 2016-11-30T08:05:41Z

While implementing an alternative version of the SparqlDataGen I noticed that:

some methods are can be generalised.
the description of the endpoint from which tw retrieves the data from can be improved

I suggest to:

add in the .property file and/or in the GeneriDataGen constructor
-- the response format from the server (JSON-LD is quite standard but other format might be possible RDF+XML)
-- the files for the queries that indicates the selecting template and the construct template
abstract the following methods
-- send next
-- load file
make private the following methods
-- select indices (query file is the only exposed parameter)

dellaglio · 2016-11-30T09:14:38Z

At the moment, TripleWave has three datagen components:

sparqlDataGen: from a SPARQL endpoint to RDF stream (at the moment tested only with Fuseki)
rdfStreamDataGen: from file containing an RDF stream to RDF stream
wikiStream: from wikipedia update stream to RDF stream
Abstracting and defining a superclass for the above datagens may be interesting, but they work in completely different ways. Moreover, only sparqlDataGen has the sendNext and loadFile methods, and I won't change the others in order to have it, as it would just introduce overheads.

So, in the rest of the comment I assume that "Generalise DataGen" means "Generalise SparqlDataGen".
As I said above, it should be possible to abstract it, but I'm missing the use cases you have in mind. Do you want to create RDF streams out of other Web services/DBs that are not SPARQL compliant? In case, I think you need to write a dataGen out of scratch, as you cannot reuse a lot of code of the current sparqlDataGen, as it is tailored to work on a SPARQL endpoint.

Regarding RDF/XML: TW may produce RDF/XML instead of JSON-LD, I'd be happy to see that feature. However, dataGens should produce JSON-LD and the conversion should be at the end. dataGens are the first components of the pipelines, and all the next components work because they expect an RDF graph with some time annotations in JSON-LD. Trying to generalise the data exchanged internally is not a good idea, since it would lead on the performance of TW (managing several types of data requires time) and manipulating XML is definitely harder to do in JS. Said that, the best way I see to produce RDF/XML (or whatever other format) is to add a component at the end of the pipeline, before the primus-based one, that converts JSON-LD in the other format.

riccardotommasini · 2016-11-30T12:16:54Z

SparqlDataGen currently assumes the following steps/methods to fully publish a stream

load the dataset
create the indices
retrieve the indices
4- send next

IMHO the first two steps are specific for the current implementation. Working on a different branch I saw that how indices are described depends mainly on the capabilities/characteristics of the RDF Source. My suggestion is leaving the user the task of selecting the way the source is defined, if he needs so, and separate the indices definition (1) and (2) from the indices retrieval (3) and forward (4).

Regarding the different data formats, I was targeting the way the Source SPARQL endpoint behind tw not the way tw expose the data. For instance, the version of sesame I am using does not respond in json-ld so it is necessary to serialise the output from rdf/xml into json-ld manually.
Configuring this option might simplify the usage of tw with any SPARQL endpoint.

dellaglio · 2016-11-30T13:09:57Z

Ok, so it was about SparqlDataGen - I updated the title.

Any improvement to this component is welcome, as far as the SparqlDataGen produces the JSON described above, it does not become even more complicated and performance does not decrease too much.

To summarise, the two requirements should be:

sparqlDataGen may query SPARQL endpoints that do not support JSON-LD
sparqlDataGen may be more flexible in the range of RDF sources it manages

While 1. is clear, I have some doubts about 2... can you provide some use cases to show cases where the current version fails. Having some use cases, it will make easier to understand how to do it.

dellaglio changed the title ~~Generalise DataGen~~ Generalise SparqlDataGen Nov 30, 2016

dellaglio added the enhancement label Nov 30, 2016

dellaglio added the rsplab label Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalise SparqlDataGen #10

Generalise SparqlDataGen #10

riccardotommasini commented Nov 30, 2016

dellaglio commented Nov 30, 2016

riccardotommasini commented Nov 30, 2016

dellaglio commented Nov 30, 2016

Generalise SparqlDataGen #10

Generalise SparqlDataGen #10

Comments

riccardotommasini commented Nov 30, 2016

dellaglio commented Nov 30, 2016

riccardotommasini commented Nov 30, 2016

dellaglio commented Nov 30, 2016