BetterThanJsonIdeas
So, what do we want out of a human readable, self describing language?
- Familiar syntax may be helpful, but…
- …not a strict superset of json, it’s just not a priority and may present a false equivalence
- tuple and struct types
- better defined numbers
- decimals – a number with no lossy represention like floats
- datetime’s
- A well defined normalization format
- Comments, even if they’re allowed to be stripped
- Well defined binary fields, even if they’re bloat-y
- Lossless translation to/from a (specific format of) a binary repr, like cbor or msgpack
- Lossless translation to/from a canonical form that can be hashed nicely
- fucking trailing commas
- Strict reference impl and test suite
- Simple and optional schema
- We know unambiguously what each type is w/o context, just by parsing it
Let’s just see where this goes
ron seems to check MOST of these boxes, investigate more for quality and such and maybe contribute to it.
First the HARD question
What to name this?
Someone called it Icefox Markup Language, which isn’t what I want, but IML is a nice abbreviation. Could just call it Immel? I don’t hate that.
Value
May be atom or compound value
Atoms
Integer
64-bit signed integer value.
No hex or anything like that just yet.
123
Float
64-bit floating point value.
Must include a decimal point, may include a trailing zero.
123.5
, 123.0
, 123.
Decimal
A signed number with a fractional part which is encoded losslessly. Intended for things like money.
For now we just suffix them with a d
. 123.64d
How? Not sure yet. Bounds? Not sure yet, but must be specified. Can we make it fit reasonably into a 32-bit float? 64-bit float? Two 32-bit integers?
For now, let’s just say it must be large enough to contain a whole part >= a 32-bit signed integer and a decimal part with at least 9 decimal digits of precision (must be able to losslessly represent billion’ths).
String
"foo"
.
UTF-8 encoded, may contain any unicode character. Must contain valid UTF-8. If it may not, it’s not a string, it’s a binary string.
Binary string
A string starting with b"
and ending with b
containing base64 encoded text.
Compound types
Array
Ordered sequences of items, all of which have the same type. Sequence may be any length.
[1, 2, 3]
Tuple
Ordered sequences of items, which may have different types. Sequence may be any length.
(1, "two", 3.0)
Dict
Unordered set of (key, value) pairs. Keys are literal tokens, not any kind of value. Keys must start with a lowercase letter and may contain a-zA-Z0-9_
.
~~May contain duplicate keys, against my better judgement. If it does, the last value for a key is the one you use.
{foo: 1, bar: "two", baz: 3.0}
…which is equivalent to…
{baz: 91.3, bar: "two", foo: 1, baz: 3.0,}
~~
May not contain duplicate keys.
Enum
A string-like identifier, optionally with a single (non-enum?) compound type attached to it. Serves the purpose of enums and of sum types.
The identifier is bare string and the compound type follows it. The identifier must start with a capital letter. Can only contain the characters a-zA-Z0-9_
for now, for the sake of simplicity. The goal is to be able to have labels and such that are easy to intern, unlike full strings.
The contents may be nothing, or may be an array, tuple or dict.
For the purposes of arrays, where “all items have the same type”, all enums are equivalent to each other.
JustAnId
.
Foo(1, "two", 3.0)
Bar[1,2,3,4,5]
Baz{foo: 1, bar: "two"}
You COULD make this just be a tuple of ("Foo", (1, "two", 3.0))
or such and it would be semantically equivalent… but the goal is to make this easier for humans.
Other
Comments
//
for line comments, /* */
for block comments. Block comments may be nested.
Derived types
Datetime
Heck. Well, it’s basically a tuple or enum.
Year, month, day
Year, month, day, hour, minute, second, timezone. Microsecond? Nanosecond?
Might be nicer if we had a string representation though?
Duration
64-bit seconds value, 32-bits nanoseconds value
Canonical form
- Keys in dicts are all sorted in ASCII alphabetical order
- No comments
- A specific indentation and comma layout
Canonical indentation style
There must be one, and a pretty-printer program to turn arbitrary programs into it. Should it be geared for human readability or machine readability (minified)? Human readability, if size matters you can turn it into CBOR.
Lossless conversion to a subset of CBOR
We use canonical CBOR as described in section 3.9 of RFC 7049
Type conversions for how to turn this into CBOR:
- Integer -> signed or unsigned CBOR integer (major type 0 or 1)
- Float -> IEEE 754 64-bit precision floating point (major type 7, subtype 27)
- Decimal -> CBOR decimal fraction
- String -> Text string, major type 3
- Byte string -> Byte string, major type 2
- Array -> Array, major type 4
- Tuple -> Array, major type 4. How do we distinguish them going the other way???
- Dict -> Map, major type 5, with text strings as keys. Not byte strings, since maybe someday we want to support non-7-bit characters in identifiers.
- Enum -> ???
- Datetime -> CBOR date/time, basically a string with tag type 0
Symbols and enums are tricky. Omit them? One-item map with a particular key??? Use tagged items? Text strings, and make text strings actually containing text add a mangling character? Ugh.
Canonical CBOR form
Basically specified just by taking the canonical text form of this and translating it into CBOR via the above rules, I think.