I know JSON is the standard now, but are there “better” serialization formats out there? Especially since JSON doesn’t know what an integer is in the spec
JSON does a couple things really well, and most other things terribly.
But the things it does well are pretty valuable. So in the "strengths" category I'd put the two following points:
1. JSON is very easy to read and understand as a human
2. JSON stuck to the basics. No comments, no references, no clever tricks, and not much space to let folks try to hammer in cleverness (see - no comments).
Neither of those are all that much related to JSON itself as a format - the semantics are basically an accident of timing around JS syntax from the 2000s.
But it's very, very useful to be able to get the raw text for a network message and know exactly what's getting sent without having to have a whole specialized tool framework to parse and understand the message.
It's also useful to not let the spec get so complex that I never want to do that, even if I could (see: xml).
So with JSON - I can easily read the actual network request and understand it, even with essentially zero additional tooling AND I have a very good chance of literally being able to open a text editor and create a new message with valid syntax without any other tools or references.
Further - this holds true even if I'm not an industry expert with 20 years of experience. Most random people off the street can do it with only a couple minutes of coaching.
Not many other serialization formats can do that.
Imagine taking your 8 year old, sitting them down in front of the computer and legitimately saying "JSON doesn't know what an integer is in the spec"!
It's true... but it's absolutely not the point. For normal people "number" is complex enough. And if you need an int and not a float... you can do that processing just fine after getting a JSON payload if you'd like. It won't be as fast as a specialized format (ex protobuffs), or as flexible (ex XML) as other formats - but that's a far distant concern to "Can I hold the hammer".
JSON is really easy. "Easy" as a strength is wildly discounted, but man is it a winner when you get it. I also think it's surprisingly hard to do.
Yeah, and if people would solely use them as comments for humans to read... I'm with you.
But they won't. A big part of the reason comments weren't included in JSON is that people tried to get clever with them.
Directly quoting Crockford:
> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability.
And while I'd also love to occasionally throw a comment in a json file, I don't want to have to deal with any of the headaches they would have created in the ecosystem.
And to be fair to Crockford here - it's not like he wasn't aware this was a downside. He even released a tool as a preprocessor for JSON if you wanted to put comments in: https://www.crockford.com/jsmin.html
JSON intentionally chose to stay as simple and compatible as possible, and personally - I think that constraint was the right call.
If I'm writing files I want to throw a lot of comments in... It usually means I should move to something like YAML instead.
Again - JSON is terrible at a lot of things, but really hammered on simple and easy as focus points. If you give devs a place to store data outside the structure of the protocol... they will use it for all sorts of complicated craziness... which devolves to either multiple protocols, or a really complicated protocol.
I know his reasoning for it, I just disagree with him. People added JSON parsers that allow comments and can _still_ get tricky with them. The only thing the standard not adding them did was make sure we can't rely on them being there. And, for ANY file format that is used for config (and similar) that is supposed to be human readable, being able to add comments is pretty much table stakes imo.
> The only thing the standard not adding them did was make sure we can't rely on them being there.
I mean... that's basically the entire point of the decision. If you can't rely on them... you can't rely on them to ship metadata.
If you could rely on them to ship metadata... you start seeing parsers diverge wildly in features and scope - to the point that you've really got several different protocols all pretending to be "JSON". You end up with JSON_V1_UTF8_RTL, and JSON_V4_UCS2 and JSON_V3_EXT_UNICODE_LE, etc... All of which will be subtly (or not so subtly) incompatible, and then you're right back at XML.
No one is stopping you from writing config files in a superset of JSON that supports comments (yaml being the most common).
But the HUGE win here is that those formats are actually supersets - not alternative protocols. They all parse plain JSON. They might also happen to do tricky things - but if you give it standard JSON it works a-ok.
---
I think that there's an argument to be made that given how successful JSON is NOW - adding the ability to insert comments might be valuable (See: JSONC, or JSON5)
But I think as far as the initial stages went, not having comments was pretty clearly the right call.
Like - it's just a small step from comments to processing_instructions (ala XML: https://en.wikipedia.org/wiki/Processing_Instruction) and while I don't always agree with Crockford... I'm with him 1000% that the second you let that kind of metadata live in the format... the complexity EXPLODES.
Better to keep the standard format intentionally clean of it, and let people declare their own supersets.
As the other poster said, you could use XML which is more powerful, but as a result is a lot more complex. For most tasks I'd prefer JSON because while it is lacking, all the real world parsers I've seen are much easier to work with and I rarely need more complexity. If someone did a JSON++ (I have no doubt many people have but I'm not aware of them!) that added things like integers, without the complexity of XML that might be even better. In the real world if something should be an integer it isn't hard to check that and error out - you need to support parse errors in any data format anyway.
Protobuf is sometimes better for data serialization. It isn't human readable, but you rarely need that and saving data bytes is often useful even today. Protobuf does have your integer type that you are missing, but it has other limitations might or might not apply to you. (I don't use protobuf enough myself to know what they are.
Sqlite has more than once suggested that their database file is a great serialization format. You get a lot of power here and for complex things a database is often easier to work with than an xml file. There are various no sql databases as well that sometimes can work for this.
I've handwritten my own serialization format in the past. The only hard part is designing enough the ability to add whatever the future needs are (note that I've never had to read my serialization on a different CPU family, things like little vs big endian I'm told can be a pain)
There might be something else I didn't cover... Everything has pros and cons.
Protobuf does support JSON encoding[0], which I like as the .proto definition is quite readable, and then you can encode/decode either human readably or efficiently. It's even quite easy to have your consumer support both since the two are pretty easy to tell apart and if you know its either one or the other, you can just failover trying one to the other, possibly at some small cost... the guide also does point out some significant downsides to relying on the JSON version, but it can be useful in development and/or debugging in some cases, especially if you control both sides sending and receiving and can just toggle to use it when you want temporarily.
JSON has one glaring flaw: nested json encoding in strings becomes awful to read. I encounter it too often in reality where individual layers use JSON, but want to support arbitrary strings in their API. Encodings which use prefix length don't suffer from this, which ironically even includes most binary formats.
Back to my main point though: normally I don't need the complexity that things like nested JSON would be. When you do though JSon is a bad format. (actually I would go so far as to say you never need something that complex - but the problems you are trying to solve with nested JSON are still complex enough that you should use a more powerful/complex framework, but better design of your data store would avoid the need for nested JSON.)
If you have the correct version available. All to often when debugging problems the person in the field doesn't have the correct tools, or doesn't know how to use them (in this case you may not want to share the proto config with that person...) As such the less tools needed to understand something the better.
Different formats are good for different things, but I think DER is much better. No character escaping is necessary, Unicode is not required (although it can be used if you want to do), arbitrary binary data can be stored, integers can be arbitrarily big (although implementations might only support integers as big as the specific application requires), you can skip past any block without needing to know how to interpret it, and many other advantages. (However, I had made up a variant with a few additional types, such as: key/value list, BCD string, TRON string, etc. This makes it strictly a superset of the types of data which can be stored in JSON (if the types you use are: sequence, key/value list, real number, null, boolean, and UTF-8 string). I use DER in some of my programs, because I think it is generally much better than JSON. Also, DER is a binary format, although I did make up a text format (called TER) which can be converted to DER (but TER is not really meant for other uses, since it is more complicated to handle).)
If you care a lot you can use Protobufs. Downside is now everything has to speak protobuf + no can't read in your network tab. Upside is (mostly) smaller payloads and a lot more type safety.
I can't imagine why. XML is still fundamentally, well, a markup language, not a serialization format designed as such. But the "extensible" part isn't so accurate - attributes aren't extensible. GP complains that JSON doesn't know what an integer is (as distinct from a generic number), but at least it does know more than just strings. And needing to repeat a tag name when closing it just adds useless complexity.
It’s not anymore useless than a closing } or ], except since it has the tag name in it, so when I’m reading a highly nested object I’m not stuck in my text editor looking at a bunch of }’s at random indentation levels I have to scroll all the way back up to regain any context for. Tags are text which is visual structure I can choose to read, or choose to gloss over and use as bulk to shape the data in my head.