BetterThanJson
I want to take a brief look at various data serialization formats and compare them. Basically the goal is to answer the question, “can we find something better than JSON?” However, note that we are looking at these things for DATA SERIALIZATION, not for config files and stuff, so that’s the goal by which these will be judged.
There’s two orthogonal axes to look at these things under:
- Self-describing vs. schema-defined formats
- Human readable vs. machine-readable formats
That is, whether the type information for a structure is defined in a separate file (a schema) that a receiving program checks against, or whether the message itself contains type information. It’s almost exactly the difference between statically and dynamically typed programming languages. Like programming languages, both have pros and cons, neither of them are always better than the other. The goal of this is to compare apples to apples, so we’re gonna note which category these things fall into but not make value judgements based on them. There’s also fuzzy edges; many self-describing formats optionally have a schema layer too. Similarly, we will not really compare tooling quality; the goal is to look at the intrinsic properties of the formats. The culture surrounding them may be considered though.
This is also important not to conflate with an RPC protocol, though many of these things are used IN RPC protocols. Keep in mind that HTTP/REST interfaces are often just a type of RPC protocol, whether realized that way or not.
Up to date as of October 2020. Doesn’t try to include myriad minor things, ’cause there’s only so much time in the world.
Human-readable languages
JSON
What everything gets currently compared against. We all know JSON, we all agree it’s Sorta Good Enough but really is kinda crap.
Category: Human-readable, self-describing. (https://json-schema.org/ exists but does not seem very widely used.) Has an RPC protocol but it also seems lightly used, this might be more general.
Users: Everyone
Pros:
- Similar to major programming languages – Easy to understand and debug
- Simple – Easy to read, write, and understand… at least for simple things. Turns out there’s a lot of gotcha’s though.
- Pretty compact if minified
Cons:
- Type system is pretty shit – no date/time, no real integers, no real structs, no unions/tuples/etc
- Tends to discourage schema’s – “So simple it doesn’t need it”, until it becomes less simple.
- No normalized form – fields may be reordered, duplicated, etc. Makes hashing it hard, gotta read whole message to begin verifying it, etc.
- No comments – harder to write well than you might think!
- No good way to contain binary data
YAML
Started out as a simpler alternative to XML.
Category: Human-readable, self-describing.
Users: Lots of people
Pros:
- Vaguely simple to read and write, in its basic form
- Low visual noise
Cons:
- Way too complicated – they made it a strict superset of JSON for some damn reason, and nobody uses that form, so it’s just a pile of wasted effort
- Reference impl incomplete, other impl’s disagree with each other and the spec
XML
https://en.wikipedia.org/wiki/XML
Not sure anyone really knows how XML happened. It’s basically the W3C’s fault, I think? It’s okay for some things but in the end I’m not sure it’s something anyone actually wants to use, it’s just going to be one more of those mistakes of the past.
Category: Human-readable, self-describing with common schema usage. Has an RPC protocol and many other complicated things.
Users: Everyone who can’t avoid it.
Pros:
- Promotes schemas and validation
- Simple to use for simple things
- Actually pretty decent for documents
Cons:
- I’ve never gotten schemas and validation to actually work in practice
- Everything is string-ly typed
- No real arrays
- Complicated as frig
- Very verbose
- There’s like 3-4 different ways to do everything
- Still no good way to contain binary data
Machine-readable languages
Protobuf
https://developers.google.com/protocol-buffers/
aka Protocol Buffers, but that’s a pretty dumb name. Google’s common, fast on-the-wire serialization format.
Category: Machine-readable, schema-defined. Has an RPC protocol built around it.
Users: Google, basically everyone
Pros:
- Backed by Google, so it’s going to be good at the things Google values
- Basically reasonable
- Now has some support for versioning schemas, though it’s a hard problem in general
Cons:
- Backed by Google, so it’s going to be good at the things Google values
- Not particularly simple
- Wire protocol may be more work than it needs to be
- Its type system could maybe be better
Cap’n Proto
The Other Binary Serialization Protocol.
Category: Machine-readable, schema-defined. Designed primarily for RPC, which is built in to the reference implementation.
Users: sandstorm.io, Cloudflare?, various other people but it doesn’t seem like that many
Pros:
- Designed to be fast
- Made by one of the people who worked heavily on Protobuf at Google, so there’s lots of experience behind it. That said, doesn’t mean this cat’s always right, but there’s certainly opinions that are trying to be expressed.
- Sophisticated RPC comes as part of the standard package
- Designed for zero-copy deserialization
- Designed for schema to evolve
- Adorable name
- Very explicit about correctness and conformance things such as field ordering and layout
Cons:
- Very explicit about correctness and conformance things such as field ordering and layout
- Lots of the docs and concepts are pretty low level, you usually ain’t gonna need it
- Seems more complicated than protobuf – this might be one reason there’s fewer 3rd-party implementations
Thrift
Apache’s version of Protobuf. Does anyone actually use this? Facebook, apparently, since they invented it and then gave it to Apache. Anyone else?
Category: Machine-readable, schema-defined. Designed primarily for RPC.
Users: Basically mostly Facebook? Twitter and AirBNB also apparently use it, so apparently it’s not UNpopular.
Pros:
- It works?
Cons:
- Docs suck
- Apache is the tragic junkyard of open source projects
- Apparently still not as good as flatbuffers, see below
Flatbuffers
https://google.github.io/flatbuffers/
Feels a little like Google’s answer to Cap’n Proto, as it has some of the same design goals – zero-copy serialization and layouts that are more amenable to versioning.
Category: Machine-readable, schema-defined. Includes RPC protocol.
Users: Google, Cocos2D, Facebook’s mobile client
Pros:
- Designed for zero-copy deserialization
- Designed for schema to evolve
Cons:
- Kinda feels like the problem is already solved by capnp
- Includes a JSON parser for some reason?
- Type system is kinda anemic with regards to unions
CBOR
Basically a binary re-imagining of JSON.
Category: Machine-readable, self-describing.
Users: ???
Pros:
- Pretty good type system – there’s things like fixnum’s, datetime’s, blobs, etc
- Compact
- Built-in extensibility
- Designed to be a drop-in replacement for JSON
- IETF standard
Cons:
- Kinda more complicated than it needs to be, though this is for the sake of compactness and comprehensive types. Numbers are densely packed into fewer bits when possible, for example.
- Doesn’t actually seem that widely adopted for some reason?
Msgpack
The Other CBOR, or rather, CBOR is derived from this. Designed to be simple and compact. Kinda a lot like a slightly chopped down CBOR, actually, their integer specification stuff looks nearly identical.
Category: Machine-readable, self-describing.
Users: Redis, a few others?
Pros:
- Simple
- Compact
Cons:
- Specification is kinda weak
- No real tuple or enum types
- Why not just CBOR?
BSON
As the name implies, a binary-ifcation of JSON. Created by MongoDB as its internal data format.
Category: Machine-readable, self-describing.
Users: MongoDB
Pros:
- Type system is full of deprecated and MongoDB-specific shit but is reasonably pragmatic
Cons:
- Type system is reasonably pragmatic but is full of deprecated and MongoDB-specific shit
- C strings – though there’s random non-C strings in places as well.
- Its arrays are a travesty against serializarion
- Basically an implementation detail of MongoDB, and it looks like it
Honorable mentions
Things that are interesting but not actually in the scope of serialization languages, or are otherwise irrelevant.
TOML
https://github.com/toml-lang/toml
Invalid, it’s designed as a config language, not a serialization format. It’s basically an attempt to make something as simple and ubiquitous as windows .INI files that is an actual specification rather than a fashion.
Category: Human-readable, sorta self-describing though usually you have a specific data structure you’re trying to fit it into.
Users: Various, notably Cargo (Rust’s build tool)
Pros:
- Work well as a config language without deeply nested structures
Cons:
- Works poorly when you try to make deeply nested structures
RON
Rusty Object Notation. Because shoehorning Rust’s ML-y type systeminto JSON isn’t very much fun. Works startlingly well for this purpose but is basically untried elsewhere.
Category: Human-readable, sorta self-describing though usually you have a specific data structure you’re trying to fit it into.
Users: A few, notably Amethyst.
Pros:
- Good type system for sophisticated functional-style languages
- Simple and reasonably compact
- Actually very good at what it does
Cons:
- Young, underspecified, Rust-centric
Bincode
https://github.com/servo/bincode
Included mainly for completeness. It’s not standardized outside of a single particular implementation which doesn’t promise stability, so not intended for general-purpose use. It’s intended as a fast and easy RPC/IPC format for Servo, and the actual format is basically an implementation detail of that goal.
Users: Servo, programs written by introverts who don’t care about being able to talk to each other. (Turns out this is a useful niche though, who knew.)
Pros:
- Compact, fast, simple.
- Works basically transparently for IPC with Rust code.
Cons:
- Anything other than that specific version of that specific library is undefined. If you’re OK with that though, it’s great.
ASN.1
https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One
Some stupid telecom standards body’s attempt at doing what protobuf would do later. The standard body in question is related to the one that created the willful illusion of reality called the OSI networking model.
Actually has some up sides though. If it wasn’t willfully complicated and overdesigned, it might be pretty good.
Category: Machine-readable, schema-defined.
Users: Hopefully the only places you’ve seen this are in LDAP and in SSL certificates.
Pros:
- Strong and precise type system
- Schemas EVERYWHERE
- Binary and text forms, with methods for it to be shoved into just about any other data format ever
Cons:
- There’s like eleventy billion data variant formats
- Super verbose and kinda Ada-ish
- Way too complicated to actually use, let alone implement
XDR
https://tools.ietf.org/html/rfc4506
Included mainly for hysterical raisins. Sun Microsystems’s attempt at doing what protobuf would do later.
Basically what happens when you’re a very good C coder and want to transmit structured data over the network. Pretty reasonable as far as that goes, though.
Category: Machine-readable, schema-defined.
Users: Still used in some places like ZFS, NFS, etc
Pros:
- Pretty good for what it does
Cons:
- Doesn’t necessarily do much unless you’re a C program from the early 1990’s
S-Expressions
What Lisp code is made of, an elegant notation from a more civilized time. Like lots of Lisp solutions, it works really well until you need to get two Lisp implementations to use the same kind of thing. Has steadfastly not managed to catch on outside of Lisp despite trying since at least the 1970’s.
Does not have an actual universal spec, let alone implementations. EDN is a pretty nice start though.
Category: Human-readable, self-describing
Users: Any Lisp-like language, primary Real Examples are Scheme, Racket, Clojure and theoretically Common Lisp.
Pros:
- Lisp people will love it, non-Lisp people will hate it.
- Great for representing trees
- Reasonably simple and nice
Cons:
(CAR CDR)
- Lisp people will love it, non-Lisp people will hate it.
- Does not actually have well agreed-upon syntax for compound data types other than lists.
- Anyone out there with a Lisp interpreter will try to read it with
READ
, despite it already been proven that’s a terrible idea. - No matter what form of S-expression you use, somewhere out there
someone will be annoyed that their particular form of Lisp can’t load it
with
READ
. - People will try to write Lisp code in it.
Conclusions
Draw your own.
No? Fine.
Good Enough:
- JSON?
- Protobuf
- Cap’n Proto
- Flatbuffers
- CBOR
- msgpack
Avoid:
- YAML
- XML
- Thrift?
- BSON
Appendix A: Lineage
This is actually kinda interesting ’cause it’s easy to trace each format as a reaction to ones before it. ASN.1, XDR and a zoo of even stranger stuff predate the current internet age. The Modern Age starts with XML. XML has a long lineage of its own, but it forms a kinda bottleneck. It’s one of those technological ontology changes, like a mass extinction. Most of the things people actually care about formed in reaction to XML, so that’s where I’m going to start.
So, the family tree of the most widespread things would be (apologies for those on mobile):
/--> CBOR
XML---(XML is too verbose)-+---> JSON --(JSON but binary and compact)---+--> msgpack
| \---> YAML \--> BSON
|
|
\---(XML but binary)------+---> Protobuf --(Protobuf but faster)---+---> Cap'n Proto
\---> Thrift \---> Flatbuffers
Appendix B: Thoughts
So when you actually look at this list, one thing stands out: There isn’t actually a replacement for JSON. Nothing better than it in the “human readable” column. Oh, there’s been many that have tried, such as:
…But few of those seem kept up to date, let alone used widely. JSON5 probably comes closest, by virtue of being closest to its predecessor. This seems an area ripe for innovation though.
Example of said innovation: Dhall. This might actually be the real way to go.
Appendix C: Honorable Mentions
That said, please stop suggesting more unless they get actually used by more than one organization
- BEncode – Bittorrent. Not terribly efficient or capable, but simple and self-describing.
- Avro – Hadoop/Apache/Yahoo. Does anyone actually heckin’ use this? Anything using Kafka, apparently.
- Ion – Amazon
- CDDL – A schema system for CBOR
- HCL – Hashicorp Config Language, presumably, used in their products. Needs investigation.
Also see BetterThanJsonIdeas.
Revisiting this in late 2023 adds:
- BetterThanYaml, which I like but seems to have sparked joy in pretty few other people on lobste.rs
- Lua
- Nickel which is described as “nix-ish”. idk nix at all, so can’t judge, but it appears to have functions which is a little un-thrilling.
- jsonnet which has mixed reviews
- CUE
- UCL which is apparently used for FreeBSD tooling now
- StrictYAML
- NestedText
- Preserves
- HJSON
- Hay – Used by oil
- G-expressions – Used by guix
- Text protobuf , which I didn’t know even existed.
- rkyv
These all come from lots of discussion and argument on lobste.rs:
- https://lobste.rs/s/sq5sss/yaml_config_file_pain_try_lua
- https://lobste.rs/s/rxiytz/s_lot_yaml
- https://github.com/oilshell/oil/wiki/Survey-of-Config-Languages
- A slightly older thread about XML that I can’t find right now