BetterThanJson

I want to take a brief look at various data serialization formats and compare them. Basically the goal is to answer the question, “can we find something better than JSON?” However, note that we are looking at these things for DATA SERIALIZATION, not for config files and stuff, so that’s the goal by which these will be judged.

There’s two orthogonal axes to look at these things under:

  • Self-describing vs. schema-defined formats
  • Human readable vs. machine-readable formats

That is, whether the type information for a structure is defined in a separate file (a schema) that a receiving program checks against, or whether the message itself contains type information. It’s almost exactly the difference between statically and dynamically typed programming languages. Like programming languages, both have pros and cons, neither of them are always better than the other. The goal of this is to compare apples to apples, so we’re gonna note which category these things fall into but not make value judgements based on them. There’s also fuzzy edges; many self-describing formats optionally have a schema layer too. Similarly, we will not really compare tooling quality; the goal is to look at the intrinsic properties of the formats. The culture surrounding them may be considered though.

This is also important not to conflate with an RPC protocol, though many of these things are used IN RPC protocols. Keep in mind that HTTP/REST interfaces are often just a type of RPC protocol, whether realized that way or not.

Up to date as of October 2020. Doesn’t try to include myriad minor things, ’cause there’s only so much time in the world.

Human-readable languages

JSON

http://json.org/

What everything gets currently compared against. We all know JSON, we all agree it’s Sorta Good Enough but really is kinda crap.

Category: Human-readable, self-describing. (https://json-schema.org/ exists but does not seem very widely used.) Has an RPC protocol but it also seems lightly used, this might be more general.

Users: Everyone

Pros:

Cons:

  • Type system is pretty shit – no date/time, no real integers, no real structs, no unions/tuples/etc
  • Tends to discourage schema’s – “So simple it doesn’t need it”, until it becomes less simple.
  • No normalized form – fields may be reordered, duplicated, etc. Makes hashing it hard, gotta read whole message to begin verifying it, etc.
  • No comments – harder to write well than you might think!
  • No good way to contain binary data

YAML

https://yaml.org/

Started out as a simpler alternative to XML.

Category: Human-readable, self-describing.

Users: Lots of people

Pros:

  • Vaguely simple to read and write, in its basic form
  • Low visual noise

Cons:

  • Way too complicated – they made it a strict superset of JSON for some damn reason, and nobody uses that form, so it’s just a pile of wasted effort
  • Reference impl incomplete, other impl’s disagree with each other and the spec

XML

https://en.wikipedia.org/wiki/XML

Not sure anyone really knows how XML happened. It’s basically the W3C’s fault, I think? It’s okay for some things but in the end I’m not sure it’s something anyone actually wants to use, it’s just going to be one more of those mistakes of the past.

Category: Human-readable, self-describing with common schema usage. Has an RPC protocol and many other complicated things.

Users: Everyone who can’t avoid it.

Pros:

  • Promotes schemas and validation
  • Simple to use for simple things
  • Actually pretty decent for documents

Cons:

  • I’ve never gotten schemas and validation to actually work in practice
  • Everything is string-ly typed
  • No real arrays
  • Complicated as frig
  • Very verbose
  • There’s like 3-4 different ways to do everything
  • Still no good way to contain binary data

Machine-readable languages

Protobuf

https://developers.google.com/protocol-buffers/

aka Protocol Buffers, but that’s a pretty dumb name. Google’s common, fast on-the-wire serialization format.

Category: Machine-readable, schema-defined. Has an RPC protocol built around it.

Users: Google, basically everyone

Pros:

  • Backed by Google, so it’s going to be good at the things Google values
  • Basically reasonable
  • Now has some support for versioning schemas, though it’s a hard problem in general

Cons:

  • Backed by Google, so it’s going to be good at the things Google values
  • Not particularly simple
  • Wire protocol may be more work than it needs to be
  • Its type system could maybe be better

Cap’n Proto

https://capnproto.org/

The Other Binary Serialization Protocol.

Category: Machine-readable, schema-defined. Designed primarily for RPC, which is built in to the reference implementation.

Users: sandstorm.io, Cloudflare?, various other people but it doesn’t seem like that many

Pros:

  • Designed to be fast
  • Made by one of the people who worked heavily on Protobuf at Google, so there’s lots of experience behind it. That said, doesn’t mean this cat’s always right, but there’s certainly opinions that are trying to be expressed.
  • Sophisticated RPC comes as part of the standard package
  • Designed for zero-copy deserialization
  • Designed for schema to evolve
  • Adorable name
  • Very explicit about correctness and conformance things such as field ordering and layout

Cons:

  • Very explicit about correctness and conformance things such as field ordering and layout
  • Lots of the docs and concepts are pretty low level, you usually ain’t gonna need it
  • Seems more complicated than protobuf – this might be one reason there’s fewer 3rd-party implementations

Thrift

https://thrift.apache.org/

Apache’s version of Protobuf. Does anyone actually use this? Facebook, apparently, since they invented it and then gave it to Apache. Anyone else?

Category: Machine-readable, schema-defined. Designed primarily for RPC.

Users: Basically mostly Facebook? Twitter and AirBNB also apparently use it, so apparently it’s not UNpopular.

Pros:

  • It works?

Cons:

  • Docs suck
  • Apache is the tragic junkyard of open source projects
  • Apparently still not as good as flatbuffers, see below

Flatbuffers

https://google.github.io/flatbuffers/

Feels a little like Google’s answer to Cap’n Proto, as it has some of the same design goals – zero-copy serialization and layouts that are more amenable to versioning.

Category: Machine-readable, schema-defined. Includes RPC protocol.

Users: Google, Cocos2D, Facebook’s mobile client

Pros:

  • Designed for zero-copy deserialization
  • Designed for schema to evolve

Cons:

  • Kinda feels like the problem is already solved by capnp
  • Includes a JSON parser for some reason?
  • Type system is kinda anemic with regards to unions

CBOR

https://cbor.io/

Basically a binary re-imagining of JSON.

Category: Machine-readable, self-describing.

Users: ???

Pros:

  • Pretty good type system – there’s things like fixnum’s, datetime’s, blobs, etc
  • Compact
  • Built-in extensibility
  • Designed to be a drop-in replacement for JSON
  • IETF standard

Cons:

  • Kinda more complicated than it needs to be, though this is for the sake of compactness and comprehensive types. Numbers are densely packed into fewer bits when possible, for example.
  • Doesn’t actually seem that widely adopted for some reason?

Msgpack

https://msgpack.org/

The Other CBOR, or rather, CBOR is derived from this. Designed to be simple and compact. Kinda a lot like a slightly chopped down CBOR, actually, their integer specification stuff looks nearly identical.

Category: Machine-readable, self-describing.

Users: Redis, a few others?

Pros:

  • Simple
  • Compact

Cons:

  • Specification is kinda weak
  • No real tuple or enum types
  • Why not just CBOR?

BSON

http://bsonspec.org/

As the name implies, a binary-ifcation of JSON. Created by MongoDB as its internal data format.

Category: Machine-readable, self-describing.

Users: MongoDB

Pros:

  • Type system is full of deprecated and MongoDB-specific shit but is reasonably pragmatic

Cons:

  • Type system is reasonably pragmatic but is full of deprecated and MongoDB-specific shit
  • C strings – though there’s random non-C strings in places as well.
  • Its arrays are a travesty against serializarion
  • Basically an implementation detail of MongoDB, and it looks like it

Honorable mentions

Things that are interesting but not actually in the scope of serialization languages, or are otherwise irrelevant.

TOML

https://github.com/toml-lang/toml

Invalid, it’s designed as a config language, not a serialization format. It’s basically an attempt to make something as simple and ubiquitous as windows .INI files that is an actual specification rather than a fashion.

Category: Human-readable, sorta self-describing though usually you have a specific data structure you’re trying to fit it into.

Users: Various, notably Cargo (Rust’s build tool)

Pros:

  • Work well as a config language without deeply nested structures

Cons:

  • Works poorly when you try to make deeply nested structures

RON

https://github.com/ron-rs/ron

Rusty Object Notation. Because shoehorning Rust’s ML-y type systeminto JSON isn’t very much fun. Works startlingly well for this purpose but is basically untried elsewhere.

Category: Human-readable, sorta self-describing though usually you have a specific data structure you’re trying to fit it into.

Users: A few, notably Amethyst.

Pros:

  • Good type system for sophisticated functional-style languages
  • Simple and reasonably compact
  • Actually very good at what it does

Cons:

  • Young, underspecified, Rust-centric

Bincode

https://github.com/servo/bincode

Included mainly for completeness. It’s not standardized outside of a single particular implementation which doesn’t promise stability, so not intended for general-purpose use. It’s intended as a fast and easy RPC/IPC format for Servo, and the actual format is basically an implementation detail of that goal.

Users: Servo, programs written by introverts who don’t care about being able to talk to each other. (Turns out this is a useful niche though, who knew.)

Pros:

  • Compact, fast, simple.
  • Works basically transparently for IPC with Rust code.

Cons:

  • Anything other than that specific version of that specific library is undefined. If you’re OK with that though, it’s great.

ASN.1

https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One

Some stupid telecom standards body’s attempt at doing what protobuf would do later. The standard body in question is related to the one that created the willful illusion of reality called the OSI networking model.

Actually has some up sides though. If it wasn’t willfully complicated and overdesigned, it might be pretty good.

Category: Machine-readable, schema-defined.

Users: Hopefully the only places you’ve seen this are in LDAP and in SSL certificates.

Pros:

  • Strong and precise type system
  • Schemas EVERYWHERE
  • Binary and text forms, with methods for it to be shoved into just about any other data format ever

Cons:

  • There’s like eleventy billion data variant formats
  • Super verbose and kinda Ada-ish
  • Way too complicated to actually use, let alone implement

XDR

https://tools.ietf.org/html/rfc4506

Included mainly for hysterical raisins. Sun Microsystems’s attempt at doing what protobuf would do later.

Basically what happens when you’re a very good C coder and want to transmit structured data over the network. Pretty reasonable as far as that goes, though.

Category: Machine-readable, schema-defined.

Users: Still used in some places like ZFS, NFS, etc

Pros:

  • Pretty good for what it does

Cons:

  • Doesn’t necessarily do much unless you’re a C program from the early 1990’s

S-Expressions

What Lisp code is made of, an elegant notation from a more civilized time. Like lots of Lisp solutions, it works really well until you need to get two Lisp implementations to use the same kind of thing. Has steadfastly not managed to catch on outside of Lisp despite trying since at least the 1970’s.

Does not have an actual universal spec, let alone implementations. EDN is a pretty nice start though.

Category: Human-readable, self-describing

Users: Any Lisp-like language, primary Real Examples are Scheme, Racket, Clojure and theoretically Common Lisp.

Pros:

  • Lisp people will love it, non-Lisp people will hate it.
  • Great for representing trees
  • Reasonably simple and nice

Cons:

  • (CAR CDR)
  • Lisp people will love it, non-Lisp people will hate it.
  • Does not actually have well agreed-upon syntax for compound data types other than lists.
  • Anyone out there with a Lisp interpreter will try to read it with READ, despite it already been proven that’s a terrible idea.
  • No matter what form of S-expression you use, somewhere out there someone will be annoyed that their particular form of Lisp can’t load it with READ.
  • People will try to write Lisp code in it.

Conclusions

Draw your own.

No? Fine.

Good Enough:

  • JSON?
  • Protobuf
  • Cap’n Proto
  • Flatbuffers
  • CBOR
  • msgpack

Avoid:

  • YAML
  • XML
  • Thrift?
  • BSON

Appendix A: Lineage

This is actually kinda interesting ’cause it’s easy to trace each format as a reaction to ones before it. ASN.1, XDR and a zoo of even stranger stuff predate the current internet age. The Modern Age starts with XML. XML has a long lineage of its own, but it forms a kinda bottleneck. It’s one of those technological ontology changes, like a mass extinction. Most of the things people actually care about formed in reaction to XML, so that’s where I’m going to start.

So, the family tree of the most widespread things would be (apologies for those on mobile):

                                                                        /--> CBOR
XML---(XML is too verbose)-+---> JSON --(JSON but binary and compact)---+--> msgpack
 |                         \---> YAML                                   \--> BSON
 |
 |
 \---(XML but binary)------+---> Protobuf --(Protobuf but faster)---+---> Cap'n Proto
                           \---> Thrift                             \---> Flatbuffers

Appendix B: Thoughts

So when you actually look at this list, one thing stands out: There isn’t actually a replacement for JSON. Nothing better than it in the “human readable” column. Oh, there’s been many that have tried, such as:

…But few of those seem kept up to date, let alone used widely. JSON5 probably comes closest, by virtue of being closest to its predecessor. This seems an area ripe for innovation though.

Example of said innovation: Dhall. This might actually be the real way to go.

Appendix C: Honorable Mentions

That said, please stop suggesting more unless they get actually used by more than one organization

  • BEncode – Bittorrent. Not terribly efficient or capable, but simple and self-describing.
  • Avro – Hadoop/Apache/Yahoo. Does anyone actually heckin’ use this? Anything using Kafka, apparently.
  • Ion – Amazon
  • CDDL – A schema system for CBOR

Also see BetterThanJsonIdeas.