BetterThanYaml
See BetterThanJson for earlier work on this.
Inspired by some conversation on lobste.rs and my own damn brain going after shiny objects instead of doing something useful.
What are the things about yaml we want to preserve? Why do people use it in the first place?
- easy, non-delimited-ish lists and key-value structures
- low noise – easy to read in places where the values matter more than the structure – that’s why people use it instead of JSON or XML
- Good for deeply nested tree-structures, like JSON or XML, unlike TOML
So what are the problems with yaml?
- Ambiguous and inconsistent
- Lots of legacy bad decisions
- Lack of composition – nesting things actually has lots of sharp edges and is easy to screw up
- Inline lists and structs are actually pretty nice sometime, and yaml doesn’t do them too well.
- String delimiters, and lack of them, are always pain
- This results in parsing being messy and that makes bugs
I personally think we should just use EDN and have sexprs everywhere, but that’s apparently still marketing suicide, so let’s try something else.
Things we need:
- Nice, easy nesting of lists and maps
- Consistent, easy-to-parse and easy-to-validate syntax
- Make the simple things simple and the complex things possible
Smaller goals:
- “A YAML file is almost always still ‘valid’ even if it is trunca” – ooh, nice callout
So here’s a first pass.
Primitive types
These are things that can’t contain other types inside of them.
Numbers
Integers are integers. 0x
prefix signifies hex numbers,
0b
binary, we’ll do 0o
for octal if anyone is
ever silly enough to ask for it.
1
0x34F
0b101
-3
-0xFF
Floats are numbers that contain a period. Their allowed precision is implementation defined for now, even though that’s a bad idea.
1.
1.0
1.01
It’s probably also a good idea to allow
[number]e[integer]
syntax for powers of 10.
1000.0
1e3
1.0e3
1.159e-37
Booleans
Booleans are booleans. True or false. No yes
or
no
bullshit. We might want a nil
someday but
not right now.
true
false
Strings
Strings are strings. They are enclosed by double quotes.
\x
can be used to escape characters, where x
can be one of \nrt
. You can also do \u{1234}
or something to make a unicode code point I suppose; research this
more.
Strings can contain newlines I suppose, ’cause I’m not sure there’s a reason they shouldn’t.
We don’t have a separate type for characters, they’re just strings of length 1.
"foo"
"hello\tworld"
"hi there
world"
Possible elaborations
There’s a number of ways to make strings nicer to write, at the cost
of making them harder to parse. We aren’t gonna have bare strings no
matter what, but maybe having > foo
or
| foo
as a synonym for "foo"
would be nice.
Less for what it looks like to write, and more for how easy it is to
read. Make ’em terminated at the end of the line though, so you don’t
have to think about parsing multi-line strings with indentation or some
such shit. So:
> foo
> bar
would be two strings, “foo” and “bar”, not “foo”.
Sigil types
I’m just gonna steal this whole hog from Elixir’s sigils ’cause that
works pretty well, actually. So a custom data type can be written with a
~
followed by an identifier ([a-zA-Z0-9_]+
)
followed by a delimited string. This string is then parsed into some
type based on the identifier, if the parser understands what type it
is.
So you can write a regex like this:
~r/foo|bar/
Or a date like this:
~Date[2023-01-01]
Right now we don’t try to describe what the valid identifiers are. We can do that later, and implementations are free to add their own. But it’s useful to have some standard representations for common things, so for now we’ll just steal Elixir’s list and have:
~U
– UTC datetime~D
– date~r
– regex~s
– string
For now we will follow Elixir’s example and allow the following delimiters:
~r/hello/
~r|hello|
~r"hello"
~r'hello'
~r(hello)
~r[hello]
~r{hello}
~r<hello>
All delimiter pairs are treated the same way. Per Elixir’s docs, “The
reason behind supporting different delimiters is to provide a way to
write literals without escaped delimiters.” So if you want your regex to
include /
you can write ~r"thing|/"
No
characters inside the delimiters are escaped.
Keywords/symbols?
Not yet.
nil/none?
Not yet.
Compound types
Lists
This is where yaml is relatively nice, tbh. At least to read. If things aren’t nested too deeply.
What if instead of yaml we do something like this?
- 1
- 2
- 3
# -> [1, 2, 3]
- 1
-- 2
--- 3
# -> [1, [2, [3]]]
So then we could write nested lists pretty easily and unambiguously without having to fucking eyeball any indentation:
- 1
-- 1.1
--- 1.11
-- 1.2
-- 1.3
- 2
- 3
-- 3.1
# -> [
# 1, [
# 1.1, [
# 1.11
# ],
# 1.2,
# 1.3,
# ],
# 2,
# 3, [
# 3.1
# ]
# ]
Considering how much fiddly annoyance it was to translate the first version into the second version, I think we might be on to something.
This is not indentation sensitive. The number of dashes signifies nesting of lists.
- 1
- 2
- 3
# -> [1, 2, 3]
-- 3
- 4
-- 5
# -> [[3], 4, [5]]
Not sure this is a good idea but I guess we will see how cursed it gets when life becomes more complex.
Maybe we allow _
or such as well, so if shit gets really
deeply nested you can use it as a separator, like commas or underscores
in numbers.
-------- 1
-------- 2
-------- 3
# can become
----_---- 1
----_---- 2
----_---- 3
Do we treat it as nothing, or as another -
character?
Not sure. Other possible candidate characters for this are
|
or ,
I suppose.
Empty elements are invalid. So you cannot do:
- 1
-
- 3
To be unambiguous with negative numbers, you may NEVER omit the space after the last dash in a line.
- 1
- 2
-3 # Always parses as the number negative three
- "a"
- "b"
-"c" # Not allowed
This syntax does fail the the truncation thing, though. A truncated list of this style is often still valid. Though since we require closing delimiters for strings and don’t allow empty list elements, it’s a lot more fail-safe than YAML is.
Delimited lists
Because sometimes you just want to write
[1, 2, 3]
instead of
- 1
- 2
- 3
Let’s not make it possible to mix the two though. If you have a delimited list, all lists inside it must also be delimited.
Let’s just use the [1, 2, 3]
syntax itself. It works.
Trailing commas are allowed.
Maps
idfk.
Honestly this might be a case where just staying with the traditional
{key: val}
syntax is a good idea. Nesting lists and maps in
yaml always ends up kinda horrible.
But it’s also pervasive as hell and not having such an option gets realllll noisy. What happens if we just have a line-based syntax that makes struct-nesting explicit with a prefix, like we do with lists?
{
foo: 1,
bar: 2,
bop: 3
}
. foo: 1
. bar: 2
. bop: 3
{
foo: 1,
bar: {
inner: "something",
another: "something else",
},
bop: 3
}
. foo: 1
. bar:
.. inner: "something"
.. another: "something else"
. bop: 3
It’s…. kinda cursed but also kinda works. It’s kinda still an
indentation-based syntax, but we’re making the indentation visible and
resetting it whenever heterogenous structures are nested. It lets us get
rid of the commas between members, which is also nice; we just use
newlines instead. Not sure whether .
is the right sigil for
it or not but that’s easy to change, I tried @
but it was
way worse. Maybe need something between the two in terms of visual
impact.
So I think that again like lists, we want both a delimited and a line-based syntax.
Comments
#
to the end of the line. Multiline comments can get
bent for now.
Exercise: translate some Ansible nonsense to this
- hosts: all
remote_user: ansible
become: yes
tasks:
- name: test connection
ping:
- name: SSH started (a bit of a tautology, I know)
service:
name: sshd
state: started
enabled: true
- name: Install rsyslog
apt:
name: rsyslog
state: present
- name: Start rsyslog
service:
name: rsyslog
state: started
enabled: true
- name: Install logrotate
apt: name=logrotate state=present
- name: Setup cron job to clean apt cache
copy:
src: conf/etc/cron.monthly/apt-clean
dest: /etc/cron.monthly/apt-clean
owner: root
group: root
mode: 0700
If we only have the {}
struct syntax, this becomes:
{
hosts: "all",
remote_user: "ansible",
become: true,
tasks:
- {
name: "test connection",
ping: "",
}
- {
name: "SSH started (a bit of a tautology, I know)",
service: {
name: "sshd",
state: "started",
enabled: true
}
}
- {
name: "Install rsyslog",
apt: {
name: "rsyslog",
state: "present",
}
}
- {
name: "Start rsyslog",
service: {
name: "rsyslog",
state: "started",
enabled: true,
}
}
- {
name: "Install logrotate",
apt: { name: "logrotate", state: "present", }
}
- {
name: "Setup cron job to clean apt cache",
copy: {
src: "conf/etc/cron.monthly/apt-clean",
dest: "/etc/cron.monthly/apt-clean",
owner: "root",
group: "root",
mode: 0o700,
}
}
}
hmm hmm hmm, interesting. Good example of the fact that we want
something other than {}
delimited maps. The commas also add
a lot of noise. Very interesting. Not what I would call very
desirable.
Let’s try out the prefixed struct syntax:
. hosts: "all"
. remote_user: "ansible"
. become: true
. tasks:
- . name: "test connection"
. ping: ""
- . name: "SSH started (a bit of a tautology, I know)"
. service:
.. name: "sshd"
.. state: "started"
.. enabled: true
- . name: "Install rsyslog"
. apt:
.. name: "rsyslog"
.. state: "present"
- . name: "Start rsyslog"
. service:
.. name: "rsyslog"
.. state: "started"
.. enabled: true
- . name: "Install logrotate"
. apt: { name: "logrotate", state: "present", }
- . name: "Setup cron job to clean apt cache"
. copy:
.. src: "conf/etc/cron.monthly/apt-clean"
.. dest: "/etc/cron.monthly/apt-clean"
.. owner: "root"
.. group: "root"
.. mode: 0o700
I… don’t think that I hate it???