Add XML abbreviations and BNode ellision #39

phillord · 2021-02-19T13:46:47Z

Currently the XML serialization is all in long hand. It would be good to have some of the short hand syntaxes, for example typed nodes and the ability to remove explicit BNode IDs where possible.

For example:

    f.format(&Triple {
        subject: NamedNode { iri: "http://top.level/top_sub" }.into(),
        predicate: NamedNode { iri: "http://top.level/top_pred" }.into(),
        object: BlankNode{id:&bnid}.into()
    })?;

    f.format(&Triple {
        subject: BlankNode{id:&bnid}.into(),
        predicate: NamedNode{iri: "http://one.deep/one_pred"}.into(),
        object: NamedNode{iri: "http://one.deep/one_obj"}.into()
    })?;

currently produces

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://top.level/top_sub">
        <top_pred xmlns="http://top.level/" rdf:nodeID="bn1"/>
    </rdf:Description>
    <rdf:Description rdf:nodeID="bn1">
        <one_pred xmlns="http://one.deep/" rdf:resource="http://one.deep/one_obj"/>
    </rdf:Description>
</rdf:RDF>

where as something like this:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns="http://top.level/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:default1="http://one.deep/"
>
  <rdf:Description rdf:about="http://top.level/top_sub">
    <top_pred>
      <rdf:Description>
        <default1:one_pred rdf:resource="http://one.deep/one_obj"/>
      </rdf:Description>
    </top_pred>
  </rdf:Description>
</rdf:RDF>

would be better. Combined with the other short cut syntaxes, this will make a big difference to eventual file size!

Tpt · 2021-02-19T19:58:06Z

Yes, having prefixes and removing not useful blank node IDs would be great.
However, to be done well it requires to drop the current streaming API: with the current API there is no way to know if a blank node is going to appear in an other triple later during the writing so you need to know the complete graph to be able to know if you can omit the blank node ids. Similarly, if there is no way to know when writing the opening rdf:RDF node which prefix should be used.

About URI prefixes, I am considering allowing the Rio user to provide them when building a RdfXmlSerializer (and similarly for Turtle/TriG).
For blank node ids and automated shared prefixes, it might be nice to provide another serializer for in-memory graphs.

phillord · 2021-02-20T11:18:22Z

You are correct about prefix support of course, and that would also be a great addition.

I have given a little thought to it, and I think you could stream, although, of course, which bnodes get elided would depend entirely on the order in which you stream. Another possibility would be have a semi-streaming interface. You might allow, for example, passing of an OwnedBlankNode as the subject of a triple. The parser would then cache these triples till the equivalent BlankNode appeared again, as an object. This would allow the client of the library to give strong hints about what they expected.
For my use case, this would be fairly memory efficient because I mostly know when a BNode is coming; obviously in worse case, every triple could be held in memory till last.

Perhaps the solution is do as Jena has done, and have two serializers: RDF/XML and RDF/XML-ABBREV.

phillord · 2021-02-20T12:09:26Z

I realise btw, that this is non trivial to implement. I thought it worth opening the issue to get some idea of the API that would enable this to work, before I build against the current API.

phillord · 2021-02-21T18:54:37Z

Incidentally, this seems to be very related to #25. At the moment, the XML renderer is cloning the subjects that go into it. In my cases, I am creating the Triple objects from Rc<String> that I have interned, so it's a shame to clone the raw &str's underlying them.

Tpt · 2021-02-21T19:20:47Z

Perhaps the solution is do as Jena has done, and have two serializers: RDF/XML and RDF/XML-ABBREV.

Yes, it seems like the best option: a fast serializer that outputs "ugly" data and a slower and more memory hunrgy serializer that outputs nicer format. This distinction would be also useful for the Turtle and TriG serializers which could benefit from the same optimizations.

Indeed, the RDF/XML serializer (and parser) could be made much faster with some optimization work. We have mostly focused on having working parsers at the moment.

phillord mentioned this issue Feb 22, 2021

Feature/abbrev xml #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add XML abbreviations and BNode ellision #39

Add XML abbreviations and BNode ellision #39

phillord commented Feb 19, 2021

Tpt commented Feb 19, 2021

phillord commented Feb 20, 2021

phillord commented Feb 20, 2021

phillord commented Feb 21, 2021

Tpt commented Feb 21, 2021

Add XML abbreviations and BNode ellision #39

Add XML abbreviations and BNode ellision #39

Comments

phillord commented Feb 19, 2021

Tpt commented Feb 19, 2021

phillord commented Feb 20, 2021

phillord commented Feb 20, 2021

phillord commented Feb 21, 2021

Tpt commented Feb 21, 2021