Skip to content

maximveksler/awesome-serialization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

awesome-serialization Awesome

So many formats, so much wow. See wesm comment before making up your mind.

Contents

API

Serialization suitable for API RPC networked services.

  • CSV - Comma Separated Values. Textual.
  • JSON - Lightweight document data-interchange format. Textual.
  • JSONL - Schemeless "multiple JSON documents in 1 file" container data format. Textual.
  • JSON5 - JSON with added support for comments and relaxed syntax. Textual.
  • Thrift - Scalable code generation, schema evolution binary format. Binary.
  • Protocol Buffers - Google's data interchange format. Binary.
  • Message Pack - Efficient JSON-like binary serialization format. Binary.
  • bson - Binary schemeless JSON encoding. Binary.
  • XML - Extensible Markup Language. Genuinely Horrible. Textual.
  • Plist - Property List representation. Apple. Textual.
  • YAML - Indentation-based data serialization standard. Textual.
  • TOML - Tom's Obvious, Minimal Language. Textual.
  • CBOR - Concise Binary Object Representation. Schema-free. Binary.
  • ION - Amazon's advanced JSON-compatible serialization. Textual/Binary.

RPC

  • gRPC - A high-performance, open source universal RPC framework. Binary, ISO Layer 7.
  • RSocket - Application protocol providing Reactive Streams semantics. Binary, ISO Layer 5 (or 6).
  • Cap’n Proto - High-performance, schema-based data interchange format. Binary.
  • FlatBuffers - Suitable for zero-copy deserialization. Binary.

Big Data

Serialization suitable for big data at rest systems, from Hadoop family of solutions.

  • Parquet - Columnar storage for Hadoop workloads. Binary.
  • FlatBuffers - Protocol Buffers suitable for larger datasets. Binary.
  • ORC - Columnar storage for Hadoop workloads. Binary.
  • Avro - Scheme embedded, dynamic rich data structures. Textual/Binary.
  • Ion - Row storage with skip scan parsing. Structured, schema embedded. Amazon. Textual/Binary.
  • Arrow - Cross-language columnar data format optimized for analytics workloads. Binary.
  • Delta Lake - Transactional storage layer for big data workflows. Binary.
  • Iceberg - Open table format for large datasets. Binary.

Scientific

Large-scale sparse arrays used in physical, mathematics, and statistics research.

  • HDF5® - n-dimensional datasets, complex objects, with schema. Efficient I/O. Binary.
  • npy - Numpy arrays, cell sparse metadata. Binary.
  • NetCDF - Self-describing, machine-independent data format for scientific data. Binary.
  • Zarr - Scalable storage of n-dimensional arrays. Binary.
  • ASDF - Advanced Scientific Data Format for astronomy and beyond. Binary/Textual.

Machine Learning

Serialization of deep learning networks and weights.

  • SavedModel - TensorFlow package, weights, graph, executable code. Binary.
  • CoreML - Apple's on-device ML model format. Binary.
  • ONNX - Open Neural Network Exchange. Interoperability focused. Binary.
  • MLIR - Intermediate representation for machine learning computations. Textual/Binary.
  • TorchScript - Serialization for PyTorch models. Binary.
Deprecated in Machine Learning
  • GraphDef - TensorFlow graphs. Binary.
  • PMML - Predictive Model Markup Language for exchanging ML models.

Graph

Serialization formats for representing graph-oriented data structures.

  • json-ld - JSON for Linking Data. Textual.
  • Turtle - Terse RDF Triple Language. Textual.
  • GraphML - XML-based graph serialization format. Textual.
  • DOT - Graph description language, developed as a part of the Graphviz project. Textual.
  • ParquetGraph - Integration of Parquet with graph data structures. Binary.
  • GraphSON - JSON-based graph serialization. Textual.
Deprecated in Graph
  • GraphML - XML-based graph serialization format. Textual.- GML - Graph Modeling Language. Hierarchical ASCII-based. Textual.

Workflow

  • common-workflow-language - Specification for describing analysis workflows and tools in a way that makes them portable and scalable across a variety of software and hardware environments.
  • Apache Airflow DAGs - Python-based Directed Acyclic Graphs for workflows.
  • WDL - Workflow Description Language for genomics and scientific workflows.
  • Relational Algebra and Datalog for Graphs - Coursera course on graph data manipulation.
  • Cromwell - Scientific workflow management, compatible with WDL and CWL.
  • Nextflow - Scalable and reproducible scientific workflows.

Programming

Language-native serialization formats for in-transit data (aka memory-based "live" objects).

JavaScript

  • BSON.js - BSON serializer. Binary.
  • avsc - JavaScript implementation of Apache Avro. Textual.

Dart

Python

  • pickle - RAM to Disk serialization. Binary.
  • msgpack-python - MessagePack serializer implementation for Python.
  • srsly - Modern high-performance serialization utilities for Python.

Swift

Java

Rust

  • Serde - Rust's serialization framework for multiple formats like JSON, CBOR, and MessagePack.
  • bincode - High-performance binary serialization for Rust.

Go

  • GOB - Go's built-in serialization format for arbitrary data structures. Binary.

Streaming

Serialization formats optimized for real-time streaming data.

  • Kafka Streams - Real-time stream processing framework with built-in serialization.
  • Protobuf-Lite - Lightweight Protocol Buffers for constrained environments.

Security-Focused

Serialization formats designed with security and robustness in mind.

Academic

Research papers discussing types, category theory, benchmarks & co.

Contribute

Contributions welcome! Read the contribution guidelines first.

License

CC0

To the extent possible under law, Maxim Veksler has waived all copyright and related or neighboring rights to this work.

About

Data formats useful for API, Big Data, ML, Graph & co

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published