Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XML abbreviations and BNode ellision #39

Open
phillord opened this issue Feb 19, 2021 · 5 comments
Open

Add XML abbreviations and BNode ellision #39

phillord opened this issue Feb 19, 2021 · 5 comments

Comments

@phillord
Copy link
Contributor

Currently the XML serialization is all in long hand. It would be good to have some of the short hand syntaxes, for example typed nodes and the ability to remove explicit BNode IDs where possible.

For example:

    f.format(&Triple {
        subject: NamedNode { iri: "http://top.level/top_sub" }.into(),
        predicate: NamedNode { iri: "http://top.level/top_pred" }.into(),
        object: BlankNode{id:&bnid}.into()
    })?;

    f.format(&Triple {
        subject: BlankNode{id:&bnid}.into(),
        predicate: NamedNode{iri: "http://one.deep/one_pred"}.into(),
        object: NamedNode{iri: "http://one.deep/one_obj"}.into()
    })?;

currently produces

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://top.level/top_sub">
        <top_pred xmlns="http://top.level/" rdf:nodeID="bn1"/>
    </rdf:Description>
    <rdf:Description rdf:nodeID="bn1">
        <one_pred xmlns="http://one.deep/" rdf:resource="http://one.deep/one_obj"/>
    </rdf:Description>
</rdf:RDF>

where as something like this:

<?xml version="1.0" encoding="utf-8"?>
<rdf:RDF
  xmlns="http://top.level/"
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:default1="http://one.deep/"
>
  <rdf:Description rdf:about="http://top.level/top_sub">
    <top_pred>
      <rdf:Description>
        <default1:one_pred rdf:resource="http://one.deep/one_obj"/>
      </rdf:Description>
    </top_pred>
  </rdf:Description>
</rdf:RDF>

would be better. Combined with the other short cut syntaxes, this will make a big difference to eventual file size!

@Tpt
Copy link
Collaborator

Tpt commented Feb 19, 2021

Yes, having prefixes and removing not useful blank node IDs would be great.
However, to be done well it requires to drop the current streaming API: with the current API there is no way to know if a blank node is going to appear in an other triple later during the writing so you need to know the complete graph to be able to know if you can omit the blank node ids. Similarly, if there is no way to know when writing the opening rdf:RDF node which prefix should be used.

About URI prefixes, I am considering allowing the Rio user to provide them when building a RdfXmlSerializer (and similarly for Turtle/TriG).
For blank node ids and automated shared prefixes, it might be nice to provide another serializer for in-memory graphs.

@phillord
Copy link
Contributor Author

You are correct about prefix support of course, and that would also be a great addition.

I have given a little thought to it, and I think you could stream, although, of course, which bnodes get elided would depend entirely on the order in which you stream. Another possibility would be have a semi-streaming interface. You might allow, for example, passing of an OwnedBlankNode as the subject of a triple. The parser would then cache these triples till the equivalent BlankNode appeared again, as an object. This would allow the client of the library to give strong hints about what they expected.
For my use case, this would be fairly memory efficient because I mostly know when a BNode is coming; obviously in worse case, every triple could be held in memory till last.

Perhaps the solution is do as Jena has done, and have two serializers: RDF/XML and RDF/XML-ABBREV.

@phillord
Copy link
Contributor Author

I realise btw, that this is non trivial to implement. I thought it worth opening the issue to get some idea of the API that would enable this to work, before I build against the current API.

@phillord
Copy link
Contributor Author

Incidentally, this seems to be very related to #25. At the moment, the XML renderer is cloning the subjects that go into it. In my cases, I am creating the Triple objects from Rc<String> that I have interned, so it's a shame to clone the raw &str's underlying them.

@Tpt
Copy link
Collaborator

Tpt commented Feb 21, 2021

Perhaps the solution is do as Jena has done, and have two serializers: RDF/XML and RDF/XML-ABBREV.

Yes, it seems like the best option: a fast serializer that outputs "ugly" data and a slower and more memory hunrgy serializer that outputs nicer format. This distinction would be also useful for the Turtle and TriG serializers which could benefit from the same optimizations.

Indeed, the RDF/XML serializer (and parser) could be made much faster with some optimization work. We have mostly focused on having working parsers at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants