AvroReferences

Avro schema references and schema repositories

Introduction
Why avro references
Using maven as a schema repository
Schema development with maven

Introduction

Avro is one of the many new serialization formats that have been created in the last 20 years. For a good introductions and also a comparison between with probably the 2 most popular alternatives see

Why avro references

Since avro does not have "field tags", the schema used to serialize a message is necessary to read the message. (aka writer schema). Avro schema's can be fairly large and add significant overhead to the wire format. As such "schema registries" have been created as solutions for this problem. (see for an example) A schema registry basically maintains a ID <-> schema mapping, that uniquely identifies a schemas. This allows passing a relatively small ID instead of a larger schema definition. A communication participant, will resolve the id to the actual schema definition using the schema repository and a local cache. Aditionally schema repositories can index your schemas, validate backwards compatibility or other schema quality checks.

Let's look at how this can work:

{"type":"array",
 "items":
   {"type":"record",
    "name":"TestRecord",
    "fields":[{"name":"number","type":["long","null"],"default":0}],
     "id":"testId"}
 }

can become:

{"type":"array","items":{"$ref":"testId"}}

Where can this id come from?

One solution is to add it during "release" phase of the schema. I prefer using: "group:artifact:version:schemaId" which makes the schema easily identifiable in a maven repo. (If you chose a maven repo as a schema repo, which I describe in more detail bellow)

Using maven as a schema repository

Maven is one piece of technology out there that basically does this for java artifacts (binaries, source, javadoc...). There is no reason why maven could not fit the bill for schema's, and I would argue that it is the best choice for a lot of use cases.

Here is some of the advantages I see:

No new piece of infra needed. (you most likely already have a maven instance)
Proven scalability. You will be able to share your data models with the entire world levelraging existing CDN infra (bintray, etc...)
Dependency management that allows schema re-use.
Addresssing + versioning model.
Plugin architecture that allows developing custom plugins. (avrodoc, avro quality checks....)

With certain maven repository inplementations like JFrog Artifactory, you can easilly access individual files (schemas) from within packages without the need to download the entire package.

Schema development with maven

The format

Although avro schema's can be written in JSON, most humans will prefer the avro IDL.

The version control

Like with any piece of software, schema's should be developed using version control. You can also benefit from a code review workflow like Gerrit, PRs. (see for a sample schema project)

The project structure

/pom.xml -- your maven  project file
/src/main/avro -- your avro schema files.

your pom.xml can be as simple as:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.spf4j.avro</groupId>
  <artifactId>core-schema</artifactId>
  <packaging>jar</packaging>
  <version>1.0</version>
  <name>${project.artifactId}-${project.version}</name>
  <description>An example schema project</description>
  <parent>
    <groupId>org.spf4j.avro</groupId>
    <artifactId>schema-parent-pom</artifactId>
    <version>LATEST</version>
  </parent>

  <properties>
    <scm.url>https://github.com/zolyfarkas/core-schema</scm.url>
  </properties>

  <scm>
    <connection>${scm.connection}</connection>
    <developerConnection>${scm.connection}</developerConnection>
    <url>${scm.url}</url>
    <tag>core-schema-0.10</tag>
  </scm>
  
  <!-- add any other schema projects you want to re-use
   as normal maven dependencies here -->
</project>

The project build lifecle

Additional to the standard JAR lifecycle the following is being executed:

<phases>
 <!-- make available all dependend avro idl,avsc,avpr in target/dependencies -->
 <initialize>org.spf4j:maven-avro-schema-plugin:avro-dependencies</initialize>
 <!-- generate the java classes (/target/generated-sources/avro),
      the avsc files (/target/generated-sources/avsc) for all named schemas,
      attach a mvnId property to all schema's to  uniquely identify them -->
 <generate-sources>org.spf4j:maven-avro-schema-plugin:avro-compile</generate-sources>
 <!-- generate avrodoc (https://zolyfarkas.github.io/core-schema/avrodoc.html#/) -->
 <process-classes>maven-antrun-plugin...</process-classes>
 <!-- validate the schemas:
 	 for forwards and backwards compatibility
 	 for naming conventions, documentation
 	 custom validators -->
 <test>org.spf4j:maven-avro-schema-plugin:avro-validate</test>
 <!-- Additionally to the java classes  being packaged in a jar, 
 	the avro sources (idl + avsc) will be added to the jar, also also published 
 	separately  -->
 <prepare-package>org.spf4j:maven-avro-schema-plugin:avro-package</prepare-package>
 <!-- publish avrodoc to maven and/or scm (gh-pages)-->
</phases>

Versioning

Versioning is identical with the versioning of any other maven project. During the build process every named schema is "stamped" with a unique identifier. The format of the unique identifier is: [groupId]:[artifactId]:[version]:[localId]. An example id is: "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"

org.spf4j.demo:jaxrs-spf4j-demo-schema uniquely identifies the schema package, and the localId can be resolved from the schema_index.properties file that is added to the package:

#the package coordinates
_pkg=org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3
#local id to schema name mapping.
0=org.spf4j.demo.avro.DemoRecord
1=org.spf4j.demo.avro.MetaData
2=org.spf4j.demo.avro.DemoRecordInfo

Avro schema references in avro schemas

Currently avro avsc does not have the concept of schema references, but I really feel that this is something that will need to be eventiually added to the avro spec.

Let's say we have the following schema:

{"type":"array","items":
{
  "type": "record",
  "name": "DemoRecordInfo",
  "namespace": "org.spf4j.demo.avro",
  "doc": "A record with metadata",
  "fields": [{
      "name": "demoRecord",
      "type": {
        "type": "record",
        "name": "DemoRecord",
        "doc": "A demo record",
        "fields": [{
            "name": "id",
            "type": "string",
            "doc": "id",
            "default": ""
          }, {
            "name": "name",
            "type": "string",
            "doc": "record name",
            "default": ""
          }, {
            "name": "description",
            "type": "string",
            "doc": "record description",
            "default": ""
          }],
        "sourceIdl": "target/avro-sources/demo.avdl:6:61",
        "beta": "",
        "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:0"
      },
      "doc": "demo record"
    }, {
      "name": "metaData",
      "type": {
        "type": "record",
        "name": "MetaData",
        "doc": "meta data",
        "fields": [{
            "name": "lastAccessed",
            "type": {
              "type": "string",
              "logicalType": "instant"
            },
            "doc": "last accessed"
          }, {
            "name": "lastAccessedBy",
            "type": "string",
            "doc": "user that last accessed record"
          }, {
            "name": "lastModified",
            "type": {
              "type": "string",
              "logicalType": "instant"
            },
            "doc": "last modified"
          }, {
            "name": "lastModifiedBy",
            "type": "string",
            "doc": "user that last modified record"
          }, {
            "name": "asOf",
            "type": {
              "type": "string",
              "logicalType": "instant"
            },
            "doc": "information time"
          }],
        "sourceIdl": "target/avro-sources/demo.avdl:18:61",
        "beta": "",
        "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:1"
      },
      "doc": "record metaData"
    }],
  "sourceIdl": "target/avro-sources/demo.avdl:33:61",
  "beta": "",
  "mvnId": "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"
}}

As you can see the array elem type has been stamped by the build process with: "org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2", one way we could describe the above schema would be:

{"type":"array",
 "items":
  {"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}
}

This makes the schema json small, however to the schema parser will need to be able to resolve the "$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2" reference.

This size reduction make it possible to use schemas in HTTP headers to describe avro content schema like:

Content-Length: 220
Content-Type: application/avro;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

this also makes it more efficient to implement a "any" logical type like:

@logicalType("any")
record {
  /** the object schema */
  string schema;
  /** the avro biinary serialized object*/
  bytes object
}

In the avro fork there is a implementation for schema references. These references are resolved by pluggable "SchemaResolvers".

Since we use maven as a schema repository, it is pretty easy to implement a resolver using maven aether: spf4j-maven-schema-resolver

which is as simple to use as:

File localRepo = new File(System.getProperty("user.home"), ".m2/repository");
RemoteRepository bintray = new RemoteRepository.Builder("central", "default",
            "https://dl.bintray.com/zolyfarkas/core")
            .build();

MavenSchemaResolver resolver = new MavenSchemaResolver(Collections.singletonList(bintray),
            localRepo, null, "jar");
SchemaResolvers.registerDefault(resolver);

For where maven aether is not practical to have in your dependency tree. There is also a JAX-RS client based implementation: spf4j-jaxrs-client