LeveragingAvro

Leveraging avro as a data format

Avro is one of the many new serialization formats that have been created in the last 20 years. for a good introductions and also a comparison between with probably the 2 most popular alternatives see

In this demo project we use avro for:

wire format.
log format.

Why avro for wire format

Multiple encodings support:
- binary for efficiency.
- json for ineroperability and debugging.
- csv for interoperability and debugging.
Extensible. You can add you own metadata to the schema. (@beta, @displayName, ...)
Avro schemas have a Json representation.
Multiple language support.
Open source.

Demo of a REST endpoint

Start up the demo app as described at.

Let's try to get some data from:

images

As you can observe the writer schema info is provided by the content-schema HTTP header:

Content-Length: 505
Content-Type: application/json;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

removing ?_Accept=application/json will yield the more efficient binary response:

Content-Length: 220
Content-Type: application/avro;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

If we would desire the data in CSV format, since this endpoint is compliant we can use: ?_Accept=text/csv

Content-Length: 376
Content-Type: text/csv;avsc={"type":"array","items":{"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.3:2"}}

Additionally HTTP content type negotiation is supported by the server and a client can ask for a specific record version using the accept header:

Accept: application/json;avsc={"$ref":"org.spf4j.demo:jaxrs-spf4j-demo-schema:0.4:b"}

try out

This way a client can as for a previous version of the record, or a projection of the data. The service implementor needs to be careful with removing fields in future. Removing fields whould be done using the deprecation workflow (@deprecated avro property). The service will notify the client via a HTTP Warning header when it is accessing a deprecated property/object.

All the above is magically served by a endpoint definition like:

  @GET
  @Produces(value = {"application/avro", "application/avro-x+json", "application/octet-stream", "application/json", "text/csv"})
  List<DemoRecordInfo> getRecords();

Having your Open api descriptor and ui can also be out of the box:

images

This functionality is implemented by the spf4j avro feature and leverages Avro references

Why avro for logs?

structure. No need to write custom parsers. See for the record structure.
efficiency. smaller in size due to binary format, and built in compression.

An example of how to use avro for logs (leverages spf4j-logback and spf4j-jaxrs-actuator) is at.

As you might observe logs are written to the console, and that is on purpose. Although logging to console is what is being recommended in most literature, there are disadvantages to it. The console output is limited to text format which leads to ineficiency (large size) compounded by json wrapping and loss of structure (various libraries will write there in various formats).

here is a stdout log line example from a kubernetes node:

{"log":"SLF4J: A number (2) of logging calls during the initialization phase have been intercepted and are\n","stream":"stderr","time":"2019-05-29T01:34:59.1306243Z"}
{"log":"SLF4J: now being replayed. These are subject to the filtering rules of the underlying logging system.\n","stream":"stderr","time":"2019-05-29T01:34:59.1307042Z"}

As you can see every stdout/stderr log line is wrapped into a json object, which not only adds extra overhead to you rmessages, it also obscures their structure.

To overcome theese limitations, your logging backend can be configured to log to the kubernetes host log folder, and your logs can take 5-10 times less disk space which should increase your logging efficiency significantly.

A good example for this is at. In this example, the service can serve its own logs (cluster level), which reduces the need of a log aggregator like splunk. Actually I think deploying a log service(aggregator) to serve the logs from where they are and avoid data movement will result in a significantly more scalable system.

Here are some examples of what can you do:

Show latest logs in text format:

images