Parsing JSON Is a Minefield (2016)

q3k · on Oct 11, 2021

Some other fun facts about JSON, its mainstream implementations and using it reliably:

1. json.dump(s) in Python by default emits non-standards-compliant JSON, ie. will happily serialize NaN/Inf/-Inf. You want to set allow_nan=False to be compliant. Otherwise this _will_ annoy someone who has to consume your shoddy pseudo-JSON from a standards-compliant library.

2. JSON allows for duplicate/repeated keys, and allows for the parser to basically do anything when that happens. Do you know how the parser implementation you use handles this? Are you sure there's no differences between that implementation and other implementations used in your system (eg. between execution and validation)? What about other undefined behaviour, like permitted number ranges?

3. Do you pass around user-provided JSON data accross your system? How many JSON nesting levels does your implementation allow? What happens if it's exceeded? What happens if different parts of your processing system have different limits? What about other unspecified limits like serialized size, string length?

My general opinion is that it's extremely hard to reliably use JSON as an interchange format reliably when multiple systems and/or parser implementations are involved. It's based on a set of underdefined specifications that leaves critical behaviour undefined, effectively making it impossible to have 100% interoperable implementations. It doesn't help that one of the mainstream implementations (in Python) is just non-compliant by default.

I highly encourage any greenfield project to look into well designed and better specified alternatives.

makeitdouble · on Oct 11, 2021

It’s a pyramid.

At the bottom you have CSV which is popular beyond belief, and has no real specification, with common cases wildly handled differently across libraries.

In the middle you have JSON which isn’t 100% interoperable, but goes 98% of the way.

And you have XML and protobufs at the top tip, who have strong mechanisms available for interoperability but at an operational cost that rarely justifies the upgrade from JSON.

I suspect it will take a lot more that “well designed” and “better specified” to justify moving away from JSON as the default stepup from chaotic CSV like formats.

maple3142 · on Oct 11, 2021

I am not sure if parsing XML is better than parsing JSON. Many languages or libraries' XML parser are dangerous by default. You usually need to manually configure your XML parser to be secure from XML-related attacks. Fortunately, some languages and libraries are going to make XML have a securer defaults, this is a good change. IMO, I think XML shouldn't have include many questionable features from security perspective.

ievans · on Oct 11, 2021

For those unfamiliar with these attack vectors, there code injection and denial-of-service issues that in previous version of Python, were exploitable by default. Projects like https://pypi.org/project/defusedxml/ were designed to be secure against these issues by default, rather than requiring the library user to opt in.

The defusedxml project has an excellent matrix showing viability of the attack types against various python XML implementations: https://pypi.org/project/defusedxml/#python-xml-libraries

pdimitar · on Oct 11, 2021

Agreed, and I like the libraries I saw in the past that deliberately only support a small subset of all XML extensions (sadly now I can't remember the names). Reducing attack surface and increasing sanity in one stroke is a policy that much more open-source software has to adopt.

Sohcahtoa82 · on Oct 11, 2021

> Many languages or libraries' XML parser are dangerous by default.

Seriously, XML External Entities is an incredibly dumb feature. To have it enabled by default makes it even worse.

makeitdouble · on Oct 11, 2021

You are right, and XML parsers can have a very large attack surface due to the sheer amount of specs to adhere to.

I see XML as better in the expressiveness it has, and more mature out of the box options to validate and transform it. Security and bugs remain an issue, but at the scale it can be used, there is a fighting chance to have experts dealing with the hardening of it all.

Swagger like format definitions are still pretty lax in my option in comparison. Now I wouldn't want to get back to XML land, I just think it occupies a pretty solid niche that is hard to match with anything more simple.

tannhaeuser · on Oct 11, 2021

To be fair, XML wasn't intended as a data exchange format but as simplified SGML subset for use as delivery format on the web. While that largely hasn't happened, XML with XSD (sans rarely used feats) remains a strong exchange format for coarsely-grained inter-party traffic such as payment systems, taxes and other public/private data, etc.

I'm guessing the security deficits you mention are XML entity attacks. Well, SGML has CAPACITY ENTLVL in the SGML declaration to limit expansion depth. And a markup authoring or delivery format without entities/text macros is quite useless, even though HTML, when seen as a stand-alone markup language rather than SGML vocabulary, lacks it.

goodpoint · on Oct 11, 2021

> XML wasn't intended as a data exchange format but as simplified SGML subset for use as delivery format on the web

You cannot deliver web content without... exchanging data.

And you cannot trust servers not to attack browsers.

HWR_14 · on Oct 11, 2021

> And you cannot trust servers not to attack browsers.

Interesting. I normally see it expressed the other way (trusting the server and not the client). Obviously, both are important.

josefx · on Oct 12, 2021

There was a time when every browser hack made headlines, now it has to involve something big like specter to get even noticed. Browsers and browser based applications still feature heavily at Pwn2Own style competitions, but YOLO lets add more pointless features with bug ridden APIs with every browser update.

HWR_14 · on Oct 12, 2021

This is true and I remember when they were big headlines. I guess since I'm not in browser development, I'm used to the idea of contacting a known good server from my code, but my known good server being accessible from anyone on the internet.

Both are important, it's just interesting to see it written only from the other point of view.

tannhaeuser · on Oct 12, 2021

Not wanting to disturb your browser sec discussion, but tbh I don't know what that has to with my GP post. If anything, SGML/XML allows hard validation of document data like basically no other mainstream tech.

mcv · on Oct 11, 2021

I guess trust needs to be a two-way street. Even between computers.

HWR_14 · on Oct 12, 2021

Or lack-of-trust. Sanitize your inputs, because the other side is always an evil actor.

mumblemumble · on Oct 11, 2021

I would argue that, in the long run, gRPC/protobuf has a lower operational cost than JSON as long as you don't need to talk to it from a browser.

(Consuming it from a browser is a hassle because client-side JavaScript code is unable to speak the full gRPC API, so you need to fuss with reverse proxies to get everything working.)

What it doesn't have is a short learning curve. In order to get started, you need to learn the *.proto format, and how to use the code generator, and all the design implementations of the different data types it supports, and all of that.

But, once you get over that hump, it makes a lot of the hard stuff much, much easier to get right.

What I keep wishing for is some sort of "gRPC-lite" that doesn't include quite as many questionable micro-optimizations as protobuf/gRPC, but does include all of the really good ideas like specification-first API development, service reflection, and a clean logical separation between HTTP semantics and the semantics of the API that's being implemented on top of it.

makeitdouble · on Oct 12, 2021

> gRPC/protobuf has a lower operational cost than JSON

I'm a bit torn on this point.

We had a project where we needed to talk to an external grpc service, and protobuf was a technical given. Outside of the bootstrapping and typesetting, an issue we hit was the lack of tooling: Google officially supports protobuf libraries for 4 or 5 languages. There are other library providers, in particular for JS/Typescript, but from the get go there needs to be a strong library support or it's out of question. For instance looking at ruby, protobuf libs seem to be young at cursory glance.

It's true to some extent for JSON as well (though writing a parser for a limited set wouldn't be impossible), but I'd be hard press to come with a language that doesn't have a good JSON parser.

Then debugging also means relying on your library to encode/decode the payloads. If you need to replay a slightly modified query to check how an entry point reacts, you'll need to debug from your application, or build another test app just to emulate the requests. This issue gets compounded if you have doubt in your libraries' reliability. Development, maintenance, logging, monitoring etc. get that much heavier to deal with.

All in all I don't see protobuf as aiming for low operational cost. Or at least you need to be ready to bear the weight, and might be lucky if it ends up lighter than expected.

JSON can be scaled upward in reliability, with heavy testing and validation layers, while protobuf is hard to lighten.

mumblemumble · on Oct 12, 2021

That is a fair consideration. And part of what I was thinking about when I suggested that gRPC needs a slimmed-down alternative.

I think that part of the reason for weaker support on languages that Google doesn't personally maintain is that the gRPC protocol and tooling itself seems to be so technically complex that it's not really amenable to community support.

And I really don't think all that complexity is necessary. For example, in my own testing I found that the streaming protocol is generally not worth the extra hassle. Just getting rid of that would probably reduce the implementation cost considerably, and also eliminate the reason why client-side code can't speak gRPC.

throwaway894345 · on Oct 11, 2021

I tried to get into gRPC/protobuf in Go on a Mac for a little hobby project, but man the effort just to get protoc up and running and then generate the stubs was insane. I'm sure somehow or another, it's user error, but when the barrier of entry is so high, it's hard to justify the effort when JSON-slinging is so rarely the bottleneck.

Cloudef · on Oct 11, 2021

protobuf incurs lots of codebloat (especially with google's runtime / compilers) and the the serialization format is not really that nice IMO. I don't think it's possible to come up with 100% ideal format for all the use cases.

tiew9Vii · on Oct 11, 2021

If you never need to hit a browser Avro, Thrift, Protobuff etc., are way better and more straightforward options to deal with than JSON.

You have a strong schema, not a retrofitted on top JSON schema spec.

The schema is enforced as the data is serialized to binary format. They have schema evolution functionality.

You get the advantages of binary format, faster de/serialization speed, less bandwidth.

You are not writing your own decoders/encoders anymore. They all have code gen to generate domain objects in your language of choice. You have the schemas in a central repo and a CI job that reads them publishing a package for your language, i.e. JAR's, NPM's, Crates etc. etc. Users just pull the artifact. This part makes dev easier. You just import the artifact and can read/write the data with no effort/code.

I've had success using Protobuff in the browser without the GRPC using protobufjs. My API is a typical HTTP Rest service, but instead of sending/receiving JSON it uses Protobuff as the message format. It has made life really easy as I just import the generated Protobuf NPM, and when I make a fetch call, I deserialize the result to an object by using the auto-generated decode function and then I can call `.getFiled()`. As I'm using Typescript on the client-side, I get type checking and type hints in the IDE. If there's no `getField` function, I know that field doesn't exist.

The advantage I now have is I don't need to do OpenAPI stuff. I get that for free. I have the request objects and response objects as protobuf so there's a strongly enforced schema file acting as the contract, and I can produce client packages in any language for free.

The other benefit is it has reduced response sizes and latency. The response size was the main reason I did a POC with protobuf as working on a mapping application and sending polylines to be displayed on a map meant I was sending 10Mb+ of JSON to the browser. Protobuff significantly reduced this.

The caveat is Protobuff are hard to debug in the browser. You can't just use the net inspector as you have binary data, not plain text.

For any personal project going forward, I'll use Protobuf instead of JSON. The biggest annoyance I have is writing JSON de/serialisers in my serverside language then having to do the same for the client-side language/app, them getting out of sync when the API evolves or making typos. Using Protobuf removes all those pain points, import the package, use the package, the specification has schematics around schema evolution so API changes is just updating to the latest package. The only downside is if you have a public API no one else does it so everyone expects JSON, a non issue for private projects.

nawgz · on Oct 11, 2021

> gRPC/protobuf has a lower operational cost than JSON as long as you don't need to talk to it from a browser.

So, reading this the other way (browsers are king)... you claim gRPC is so useless as to not be able to power your entire system and requires you to standup duplicate interchange systems for different use cases?

Yikes.

mumblemumble · on Oct 11, 2021

Browsers are king for some people, not others. For the stuff I'm working on, the browser is just the tip of the iceberg. For everything below the waterline, the operational benefits (strong static typing, well-defined backward- and forward-compatibility semantics, better throughput and latency characteristics) greatly outweigh the, "but we have to use a lightweight Envoy reverse proxy to expose some things to the browser," problem.

I also have a tendency to consider that Envoy proxy to be more of a feature than a bug, anyway. It's pretty easy to set up, all told. We want to gatekeep the edge, anyway, for various reasons, so it's not like there was ever a reality in which we weren't going to be fussing with a reverse proxy. And it serves as a nice opportunity to stop and be thoughtful about exactly what we're choosing to expose to the Internet.

Speaking purely as a developer, I do find it to be an annoyance. But I also acknowledge that inconveniencing developers for the sake of the greater good can be a wise move.

bob_roberts · on Oct 11, 2021

For a cloud-based app, that might literally just be at the application gateway. Beyond that, everything could be whatever protocol.

heresie-dabord · on Oct 11, 2021

I would describe it as a _pyramid of bounded viability_ , from the minimally viable to the feature-burdened maximum.

CSV excels (ha) as a minimally viable exchange format for data. Combined with awk, grep, sed, bash, and Perl, and some simple SVG or D3 with SVG, the analytical solution is fast, scalable, and automatable.

But CSV has limits. The column headers are the schema. Beyond these bounds, we have the other formats.

JSON is messier. Its strength is in network/browser encapsulations and operations. As I have seen it used, people insert an array as a kind of schema, and they stay away from complex nesting where scalability starts failing and other difficult tooling must be summoned to compensate (parsing tools such as jq).

Beyond JSON's bounds, we have XML and associated tooling. XML is versatile and expressive.

XML and JSON can be written simply but both can be abused by programmers who aren't thinking beyond their own cursor.

This is a rich set of tooling for data representation.

In the end, one of the main advantages of CSV is that it remains a format that brings little tooling baggage ("ecosystem") to the task.

jedimastert · on Oct 11, 2021

>CSV...has no real specification

Bite your tongue sir! It has a GLORIOUS specification!

https://datatracker.ietf.org/doc/html/rfc4180

makeitdouble · on Oct 12, 2021

I started reading with a light of hope I thought was long extinct, expecting this long awaited standardization of some kind, then on the first page:

> Status of This Memo

> This memo provides information for the Internet community. > It does not specify an Internet standard of any kind.

My disappointment is immeasurable and my day is ruined.

petschge · on Oct 11, 2021

I'd argue the top of the pyramid is actually formats such as HDF5. That format was started to store voyager data and we can still read it after more than 40 years. It makes the format of entries extremely clear ("this is an 3d array of floating point numbers in IEEE755 format, with 64, 2 and 17 entries per dimension") and encourages further meat data ("it is in statV per centimeter and came out of channel of of the intrument") in addition to naming the data set "electric field". Compared to horrible piles of binary data that used to be common (and still are!), it's a breeze of fresh air to work with.

mst · on Oct 11, 2021

When faced with a case where SV is a natural fit, I've long been in the habit of (ass-u-ming I get to make that call, of course) specifying PostgreSQL COPY style TSV as the interchange format, and using more 'normal' TSV to make it easy to get data exports into Excel and friends.

That's turned out to be rather less annoying than any other approach to SV I've tried over the years.

recursive · on Oct 11, 2021

> you have CSV which is popular beyond belief, and has no real specification

What about RFC 4180? Works for me.

fomine3 · on Oct 12, 2021

Not work for rest of the world (called Excel)

jerf · on Oct 11, 2021

"My general opinion is that it's extremely hard to reliably use JSON as an interchange format reliably when multiple systems and/or parser implementations are involved."

I suspect one of the reasons that JSON has been so successful is precisely this fuzziness, though. Every language can do something a little slightly different and it'll work at first when you send it to somebody else. You get up and off the ground really quickly, and can fix up issues as you go.

If you try to specify something with a stronger schema right off the bat, I find a number of problems immediately emerge that tend to slow the process down. It may be foreign to programmers on HN who have embraced a strong static type mindset, or dynamic programmers who have learned the hard way that sometimes you need to be more precise about your types, but there's still a lot of programmers out there who will wonder why you're asking them whether this is an int or a float is relevant. I came in to work this morning to an alert system telling me that a field that a particular system has been sending as an integer for a couple of months now over many thousands of pushes, "number of bytes transferred", is apparently capable of being a float once every several thousand times for some reason. There's a lot of programmers who will send a string, or a null, or maybe a float, or maybe it's always an integer, and deeply don't understand why you care what it's getting serialized as.

And that's just an example of some of the issues, not a complete list. Trying to specify with some stronger system moves a lot of these issues up front.

(If your organization has internalized that's just how it has to be done, great! I bet you encountered a lot of these bumps on the way, though.)

This isn't a celebration of JSON per se... this is really a rather cynical take. I don't know that we need to type everything to the n'th degree in the first meeting, but "why can't we just let our dynamically-typed language send this number as a string sometimes?" is definitely something I've had to discuss. (Now, I don't get a lot of resistance per se, but it's something I have to bring up.) I'm not presenting this as a good thing, but as a theory that JSON's success is actually in large part because of its loosey-gooseyness, and not despite it, regardless of how we may feel about it.

dec0dedab0de · on Oct 11, 2021

I suspect one of the reasons that JSON has been so successful is precisely this fuzziness, though. Every language can do something a little slightly different and it'll work at first when you send it to somebody else. You get up and off the ground really quickly, and can fix up issues as you go.

I agree. Sort of how xhtml never really caught on because it was too strict. I never understood the desire to make things break when it's often less effort to make them work.

Though I think the biggest benefit of JSON is that it is so simple, at least compared to XML. It makes it harder to just dump your internal data structures as is. Which forced people to actually serialize their data. Though with time people have overcomplicated it with objects that have "type" and "value" fields, basically designing their own standard.

* There's a lot of programmers who will send a string, or a null, or maybe a float, or maybe it's always an integer, and deeply don't understand why you care what it's getting serialized as.*

As far as changing the type depending on the situation, I kind of wish that was more common. I like the idea of conveying meaning based on type, but for it to work well it would need more standard types, plus anyone using a static language would be mad at you.

q3k · on Oct 11, 2021

> Though I think the biggest benefit of JSON is that it is so simple, at least compared to XML.

Or more precisely, that it appears simple at first glance, and that it is very easy to get started with. TFA (or just practical experience trying to build an interoperable JSON-based API) should convince anyone that it is not simple in the long term :).

mbeex · on Oct 11, 2021

> Though I think the biggest benefit of JSON is that it is so simple

Still, I wish there was an option to insert comments.

dwaite · on Oct 11, 2021

> Though I think the biggest benefit of JSON is that it is so simple, at least compared to XML. It makes it harder to just dump your internal data structures as is. Which forced people to actually serialize their data. Though with time people have overcomplicated it with objects that have "type" and "value" fields, basically designing their own standard.

XML is a document language with features like mixed content to represent concepts like subsections of formatted text. IMHO quite a few of XML's failings were in the "data format" crowd being a separate camp, and the two never really pushing for good middle ground.

For the crowd that wanted a common scaffolding for document formats, having the rules between say namespace usage in XHTML vs Docbook-XML would not be a problem. For instance, HTML states you should ignore unrecognized tags and instead just show the text contents.

That all came back to bite hard when the data model people started to try to do canonicalization and document signing.

A "strict" variant of JSON fits on a napkin - basically, reject documents with multiple identical keys in an object, represent native numbers using IEEE double-precision floating point, reject documents which do not meet the grammar.

lifthrasiir · on Oct 11, 2021

I'm not convinced. There are a lot less JSON implementatations than JSON users, so we should have been possible to guide implementations with a means of proper specification and test suites. Note that the OP is possibly the first ever complete test suite for JSON after 15 full years. It is not like seeding initial implementations (that can serve as models for future implementors) is particularly hard either; Douglas Crockford himself wrote two implementations in C and JavaScript.

throw_m239339 · on Oct 11, 2021

Most of your problems aren't problems

> 3. Do you pass around user-provided JSON data accross your system? How many JSON nesting levels does your implementation allow? What happens if it's exceeded? What happens if different parts of your processing system have different limits? What about other unspecified limits like serialized size, string length?

XML has the same issue, that's why SAX exists, it works the same way with JSON.

> 2. JSON allows for duplicate/repeated keys, and allows for the parser to basically do anything when that happens. Do you know how the parser implementation you use handles this? Are you sure there's no differences between that implementation and other implementations used in your system (eg. between execution and validation)? What about other undefined behaviour, like permitted number ranges?

A parser should... parse and not interpret data or it isn't a parser. it's a deserializer. Well how many languages allow duplicate keys for maps anyway? this isn't an issue in practice.

Basically, the answer to all your problems is to use an evented parser instead of a deserializer.

kortex · on Oct 11, 2021

> this isn't an issue in practice.

It absolutely is an issue in practice. If system A handles dupes by accepting the first and ignoring the rest, and system B implements last-key-wins, then that's a potential source of bugs. The system might not fully parse to a map.

It may, for example, do string-level modification of json strings. Is that disgusting and wrong? Yes. Have I seen it in prod? Also yes.

throw_m239339 · on Oct 11, 2021

> It absolutely is an issue in practice. If system A handles dupes by accepting the first and ignoring the rest, and system B implements last-key-wins, then that's a potential source of bugs. The system might not fully parse to a map.

But the system shouldn't be automatically be parsing a "json map" to a map at first place:

    {"foo":"bar","foo":"baz","foo":"qix","fiz":"buzz"}

Shouldn't be deserialized into a map. but a Array<Map<string,string>> like structure.

A SAX style parser for JSON can help do that.

Thus the issue is the choice of parser indeed. Not JSON.

q3k · on Oct 11, 2021

> Shouldn't be deserialized into a map. but a Array<Map<string,string>> like structure.

But that's the thing: you might actually expect/want a Map<string,string>, but a malicious/broken system might emit something that cannot be deserialized into a Map<string,string>. It's then the JSON parser's/deserializer's job to figure out what to do, as the standards say to do whatever. That in turn causes different parsers/deserializers to behave differently (whatever the implementer thought makes sense), which is a source of interoperability bugs.

dragonwriter · on Oct 11, 2021

> But that's the thing: you might actually expect/want a Map<string,string>

Yes, but that's not the semantics of a bare JSON object; if you want the ability to commubicate that you intend that, then you use a schema language like JSON schema, which lets you say that the JSON map in this element doesn't allow duplicate keys and requires the values to be strings, at which point tools that read the schema language no it is safe to deserialize as Map<string, string>.

throw_m239339 · on Oct 11, 2021

> But that's the thing: you might actually expect/want a Map<string,string>, but a malicious/broken system might emit something that cannot be deserialized into a Map<string,string>. It's then the JSON parser's/deserializer's job to figure out what to do, as the standards say to do whatever. That in turn causes different parsers/deserializers to behave differently (whatever the implementer thought makes sense), which is a source of interoperability bugs.

I disagree, people are mixing up parsing and deserializing. The JSON spec isn't at fault here. The JSON spec is only concerned with defining the parsing, not the deserialization, because obviously, a JSON array isn't a PHP array or a Ruby array, a JSON map isn't a PHP object or a Go map at first place.

The problem isn't with JSON but how some JSON deserializers work. Again, a deserializer isn't a parser.

q3k · on Oct 11, 2021

> The problem isn't with JSON but how some JSON deserializers work.

That makes no observable difference to the end-user of JSON wishing to use it as an interchange format. The standard might as well be perfect, but if nearly all of its implementations (yes, extending that into deserialization, not just parsing - because that's how most people use JSON!) are problematic, then the standard is effectively also problematic. This is why I also always include Python's broken implementation in my JSON rant - it's not indicative of the standard(s) being bad, but the ecosystem being bad.

throw_m239339 · on Oct 11, 2021

> That makes no observable difference to the end-user of JSON wishing to use it as an interchange format. The standard might as well be perfect, but if nearly all of its implementations (yes, extending that into deserialization, not just parsing - because that's how most people use JSON!) are problematic, then the standard is effectively also problematic. This is why I also always include Python's broken implementation in my JSON rant - it's not indicative of the standard(s) being bad, but the ecosystem being bad.

Yes it does makes a difference to the end user. Otherwise why single out JSON? XML or YAML would suffer from the exact same issue.

Deserializers are an anti-pattern if they don't follow a strict schema. The problem again isn't the JSON spec, it's some deserializers making assumptions about JSON types.

In practice data have specs and schemas so JSON/XML/... payloads should also have schemas.

nuerow · on Oct 12, 2021

> It absolutely is an issue in practice.

It really isn't. At most, it's a problem caused by picking a broken implementation that doesn't meet your needs, but that's a self-inflicted problem, not a JSON problem.

anonymoushn · on Oct 14, 2021

> Basically, the answer to all your problems is to use an evented parser instead of a deserializer.

This seems a bit unreasonable? People face the same issue where different pieces of software treat duplicate HTTP headers differently, and usually they haven't written each piece of software they use to parse HTTP headers

detaro · on Oct 11, 2021

> Basically, the answer to all your problems is to use an evented parser instead of a deserializer.

Which "nobody" does, so it is a problem in practice.

throw_m239339 · on Oct 11, 2021

> Which "nobody" does, so it is a problem in practice.

who's nobody? if developers care about performances, they obviously do. What if the json file is 500MB of logs? Furthermore, all these JSON deserialization lib tricks might work in some languages that are dynamic or support runtime reflection, it doesn't for other languages where using a proper evented parser is mandatory.

recursive · on Oct 11, 2021

> use an evented parser

I've never heard of this. A google search isn't particularly illuminating. What is an "evented parser"?

dragonwriter · on Oct 11, 2021

> What is an "evented parser"?

Also knowns as a “streaming parser”, its a parser that takes in a data stream and produces a stream of events which client code can handle; it allows more flexible handling than deserializers, including ability to handle arbitrarily large input. SAX is a streaming/evented parser API for XML, and there are similar ones for other formats.

rovolo · on Oct 12, 2021

The sibling comments are good descriptions, but here are example interfaces:

- Normal parser https://www.javadoc.io/static/com.google.code.gson/gson/2.8....

- Streaming parser https://www.javadoc.io/doc/com.google.code.gson/gson/latest/...

Smaug123 · on Oct 11, 2021

Just a parser which fires events you can listen on when its internal state machine changes state.

throw_m239339 · on Oct 11, 2021

Google "event-based parser"

35fbe7d3d5b9 · on Oct 11, 2021

> I highly encourage any greenfield project to look into well designed and better specified alternatives.

By way of recommendation: I reach for protobufs to do data interchange between polyglot systems and have yet to be disappointed. Even if you aren't getting into gRPC, having data interchange backed by codegen and an IDL removes a lot of the risk you get with data interchange.

theamk · on Oct 11, 2021

In my experience, protobuf has a minimum project complexity threshold before it starts to make sense.

Yes, if both sides of your interchange are systems which have build infra setup, it provides a better experience. But if you need to access data from outside of your usual projects, or from shell, or from random data analysis notebooks, It becomes a major pain.

Recent example: we’ve had an orchestrator script which was written in “python with stdlib only” - no build step, download an archive, extract and run. This script had to talk to third-party program which would export protobuf only. This was a major pain as yon can imagine.

avmich · on Oct 11, 2021

In my experience JSON allowed absence of codegen and superior schema definition capabilities to protobuf, and also nice transformations with parts of jq built into JSON libraries. Try to limit structure complexity to something which can be verified before usage, yes. YMMV.

magicalhippo · on Oct 11, 2021

> My general opinion is that it's extremely hard to reliably use JSON as an interchange format reliably when multiple systems and/or parser implementations are involved.

XML is very precisely defined in comparison to JSON. Yet we've had one customer who had a system that couldn't handle XML files with newlines in them at all, and several which sometimes sends ISO 8859-1 (Latin 1) encoded data in some fields of a XML file with encoding="UTF-8" in the header...

We of course also have some nice fixed-field integrations, based on customer's specs, where the system suddenly sends multiple mangled characters if any non-ASCII character is present, causing the fields to suddenly not be so fixed anymore... It behaves very much like UTF-8 interpreted as Latin-1, except with something else than Latin-1.

Anyway, I've given up trying to be strict at this point. We will have to wash incoming data, it's apparently inevitable.

spookthesunset · on Oct 11, 2021

I mean even if it is well defined that doesn’t mean the devs are using the languages native parser library. I’ve encounter at least two projects where the devs rolled their own XML “parser” using regex and “substring” functions. Why? “The xml library was too bloated… much easier to write it ourself”. Suffice to say, they had tons to problems.

stinos · on Oct 11, 2021

You want to set allow_nan=False to be compliant. Otherwise this _will_ annoy someone who has to consume your shoddy pseudo-JSON from a standards-compliant library

Funny (well, not really) thing is NaN and Inf are perfectly valid floating point numbers acoording to most (?) standards used on computers. To the point that I don't understand why it was left out of JSON. So unless you're 100% sure you won't encounter these numbers the choice is between not being able to use JSON, or finding hacks around (and using null isn't one of them since you have 3 numbers to represent), or just using non-compliant-yet-often-accepted JSON and possibly annoying someone whos parser doesn't handle it.

And for me there have been quite a lot of cases were I just quickly needed something simple to interface between components so when finding out they all support JSON+Nan/Inf then the choice is usually made quickly.

MathMonkeyMan · on Oct 11, 2021

From a practical standpoint, defining numbers in JSON to be "whatever double precision binary floating point does, or optionally something more precise" would have been good enough, and capture what we end up having anyway.

Still, I prefer Crockford's choice: that JSON numbers are defined to be numbers. Infinity and the flavors of NaN are... not numbers.

In an extensible data interchange format, like [edn][1], people could define conventions about more specific interpretations of numbers, e.g.

    #ieee754/b64 45.6653 ; this is a double

We could build such a format on top of JSON (there are probably multiple), but I again agree with Crockford that this sort of thing does not belong in JSON.

Makes for a bunch of headaches, though, for sure.

One example is a data scientist I used to work with. He was working with lots of machine learning libraries that liked to use NaN to mean "nothing to see here." A fellow developer ended up writing code that used some sort of convention to work around it, e.g. number := decimal | {"magic-uuid": "NaN"}. I can see why some people are of the opinion "this is stupid, just allow NaNs." I disagree.

[1]: https://github.com/edn-format/edn

josefx · on Oct 12, 2021

> We could build such a format on top of JSON (there are probably multiple)

Wouldn't a perfectly valid JSON pretty print nuke your values since it could parse a 128 bit value to double before writing it out again? I wouldn't trust your theoretical format unless it ensured that normal JSON parsers, which are not aware of its requirements, fail to parse it.

MathMonkeyMan · on Oct 12, 2021

The theoretical format would have to represent the more-specific numbers using strings, e.g. if we want to mimic the edn:

    {"#ieee754/b64": "123.45"}

Nevermind that this takes up way more space, and whatever other problems.

If an intermediate parser encounters this, we don't have to worry about its representation of JSON numbers, because our numbers don't involve JSON's.

dragonwriter · on Oct 11, 2021

> Funny (well, not really) thing is NaN and Inf are perfectly valid floating point numbers acoording to most (?) standards used on computers. To the point that I don’t understand why it was left out of JSON.

There are all kinds of ways to encode that in JSON, but (contrary to JS, where “numbers” or IEEE doubles, which include various things which are either not numbers or not finite), JSON numbers are generic finite (both in size or decimal representation) numbers, so “as JSON numbers” is not one of them. (And there’s no explicit way defined in JSON, so if you want it to be unambiguous, you need externally defined semantics, but you need that for most real uses anyway.)

nomel · on Oct 11, 2021

> To the point that I don't understand why it was left out of JSON

I think you're forgetting the birthplace of JSON. Who deals with the concept of infinity and NaN in the context of web front ends?

lifthrasiir · on Oct 11, 2021

Ranges are pretty common in APIs and both -Infinity and Infinity can naturally arise from one-sided ranges. Since they are absent in JSON, they are frequently replaced with null, ad-hoc sentinel values with uncoded assumptions (e.g. timestamps should be always positive) and missing fields.

stinos · on Oct 11, 2021

I get that, but to go from "oh this won't be very common" to willingly "let's just leave this out" is something else. At least in my mind :) Or was it an oversight?

mst · on Oct 11, 2021

I suspect it was a bet on worse is better.

Whether it was a good bet is debatable, but given Crockford's focus on "try and leave out as much as possible" I can certainly see it making sense at the time.

josefx · on Oct 11, 2021

> To the point that I don't understand why it was left out of JSON

Because JSON has generic numbers that just happen to be able to represent every numeric IEEE floating point double value. In theory you could have an implementation that uses a BigDecimal class or something similar to represent numeric values. Which is of course completely incompatible with every other JSON implementation and just asks for badly tested edge cases to rear their ugly head.

stinos · on Oct 12, 2021

every numeric IEEE floating point double value

Well, but not NaN and Inf? Or which IEEE are you referring to?

josefx · on Oct 12, 2021

NaN and Inf are not numbers.

zelphirkalt · on Oct 11, 2021

Some good points there. And now imagine people wanting to needlessly use YAML for configuration, which adds loads of edge cases on top of that.

EdwardDiego · on Oct 11, 2021

> How many JSON nesting levels does your implementation allow? What happens if it's exceeded

Haha, I've met a few stack overflows in this area.

thechao · on Oct 11, 2021

I had a little non-recursive JSON parser hanging around. When you have "nested" levels you've really only got two choices: object, or array. That implies that to track nesting, you just need an array of 1b values. In order to shave the yak properly, I built "nesting compressor" that detected runs of array/object and represented them using a 64b RLE; or, it bailed out, and then just used on-the-fly compression with zstd.

Obviously, any sort of JSON file that fit on a disk I can afford can be parsed into memory in a tiny fraction of its on-disk representation. I modified `yes` to just stream `[` out; the JSON parser handled it just fine — it takes a while to roll a 64b counter.

tehbeard · on Oct 11, 2021

While there's a lot of issues with JSON, this one also applies to any other interchange format that supports nesting, including the much beloved XML. Protobuf might also have this, idk if it does any static analysis for infinite depth.

q3k · on Oct 11, 2021

The problem doesn't really exist in Protobuf, as protobuf (de)serialization is performed based on an IDL definition of the message type. Whatever that IDL specifies, a corresponding typed definition and (de)serialization function will be generated for your programming language, and that implementation will ignore any fields that weren't part of the IDL. The (de)serializing code is statically generated ahead of time, and is treated like any other code that operates on potentially nested data structures.

What this means is that if your IDL specifies deep nesting (or recursive nesting), then it means your application is expected to handle this “by contract”, and attempts to deserialize will rightfully fail in case of out-of-memory / stack overflow errors. There's no danger of an implementation 'accidentally' deserializing something nested that was passsed from the outside, as anything unknown to the IDL is simply ignored.

Finally, there's no XML-like self-references in Protobuf, so it's not possible to have an infinitely deep structure, or a combinatorial explosion like with billion laughs - just a very deeply nested one, and only if allowed in the IDL, and only up to whatever message size limit you're allowing.

tehbeard · on Oct 11, 2021

Thank you for the 2nd + 3rd paragraphs, those were parts of protobufs design I wasn't really aware of from a cursory glance.

I'm a little suprised to learn there's no self-reference support in protobuf, as I wouldn't have assumed parsing that would be an issue (as all it really is is a pointer to an existing object in the message to say, put a ref. to it here), though I guess it might be a problem in supporting certain languages.

q3k · on Oct 11, 2021

> I'm a little suprised to learn there's no self-reference support in protobuf, as I wouldn't have assumed parsing that would be an issue (as all it really is is a pointer to an existing object in the message to say, put a ref. to it here), though I guess it might be a problem in supporting certain languages.

That's a tradeoff more designs should have, IMO: reduce the feature set as much as possible, but in return make the implementation vastly simpler. :)

I assume it's not only about support in programming languages, but also exactly to eliminate the entire class of bugs that stems from back/forward-references in serialized data, and to generally keep the wire format as simple (to parse and to implement a parser for) as possible. The few usecases that could make use of references are not worth the pain inflicted on everyone if they were implemented.

charleslmunger · on Oct 11, 2021

To clarify, tags unknown to the IDL are propagated without parsing, not ignored.

ChrisMarshallNY · on Oct 11, 2021

> the much beloved XML

I can't quite resolve "beloved" and "XML" in the same sentence...

That said, I have used XML a lot, pretty much because of XML Schema.

I don't like it. No sir. Not one bit. Uh-uh...

But there's really no viable substitute.

When I design an API, I generally start with an object model, and use native converters to create XML and JSON from it.

I will provide an XML Schema with the XML variant. I often have to do this by hand, which sucks. There are tools to create Schema from dumps, but these are pretty limited. I may use them to "get me in the ballpark," but there's always lots of elbow grease.

I'll use the XML for testing, but I will usually use the JSON format for the actual implementation.

nocman · on Oct 11, 2021

> I can't quite resolve "beloved" and "XML" in the same sentence...

You mean, it's possible to take that as being NON-sarcastic??? If so, I share your lack of resolution.

Cloudef · on Oct 11, 2021

Any JSON parser that tries to handle numbers without big number support is broken. This is why I always raise eyebrows if I see json library that doesn't allow me to deal with the number myself by retrieving it as a string.

JoBrad · on Oct 11, 2021

Agreed. Even worse, one that corrupts data it doesn’t know how to handle.

kortex · on Oct 11, 2021

> I highly encourage any greenfield project to look into well designed and better specified alternatives.

Like what?

jerf · on Oct 11, 2021

Part of the problem is that there's at least half-a-dozen high quality answers out of the gate (gRPC, FlatBuffers, Protocol Buffers, XML in some cases, Thrift), and an even-longer long tail after that. It's made harder when four different teams who deeply loathe JSON and independently decide to use something "better" can legitimately use four completely different technologies if they don't communicate with each other.

35fbe7d3d5b9 · on Oct 11, 2021

To your comment above – you can bodge around interop problems with JSON in ways that you cannot with some of these other technologies.

I like to joke that I invented ndjson over a decade ago when I accidentally forgot to put things in an array before `json.dumps`, I just wasn't smart enough to call it a standard. But when you do end up with ndjson when you wanted an array of results, or vice versa, JSON makes it easy to munge things to where you need.

Compare that to something like protobuf: it's not a self-synchronizing stream, so if you send someone multiple messages without framing them (prefix by length or delimited are popular approaches), they're going to decode a single message that doesn't make much sense on the other end. And they won't be able to fix it at all.

So I guess JSON is New Jersey style design[1].

[1]: https://dreamsongs.com/RiseOfWorseIsBetter.html

q3k · on Oct 11, 2021

> Compare that to something like protobuf: it's not a self-synchronizing stream, so if you send someone multiple messages without framing them (prefix by length or delimited are popular approaches), they're going to decode a single message that doesn't make much sense on the other end. And they won't be able to fix it at all.

FWIW, this is a conscious design decision with Protobuf: it allows for easy upsert operations on serialized messages by appending another message with the updated field values. This is very useful for middleware that wants to either just add its own context to a message it doesn't even parse [1], or for middleware that might handle protobuf messages serialized with unknown fields.

On the other hand, 'newline delimited protobuf' is much less useful day-to-day than ndjson, as gRPC provides message streaming, which solves the issue of wanting to stream small elements of a long response (which is the general usecase of ndjson from my experience). For on-disk storage of sequential protobufs (or any other data, really), you should be using something like riegeli [2], as it provides critical features like seek offsets, compression and corruption resiliency.

[1] - eg. passing a Request message from some web server frontend, through request routers, logging, ACL and ratelimit systems up to the actual service handling the request.

[2] - https://github.com/google/riegeli

kortex · on Oct 11, 2021

Well, you invented one of the best things since sliced bread! I love NDjson, being able to parse a sequence of {} objects as an array is just frankly more natural. A coworker got some absurd speedup going from some massive json array to ndjson.

Honestly if json had as part of its spec line-delimited arrays, and accepting NaN, it'd be close to perfect. Oh and native ints, but that is JS's problem.

Well, and a single, canonical spec. And a hard limit (however high) on nesting depth. And some other things. Ok, maybe it's far from perfect.

syncsynchalt · on Oct 11, 2021

> teams who deeply loathe JSON

In the current world this seems like a lifestyle choice that sets yourself up for constant self-punishment.

I might be a curmudgeon but I'll take JSON for data interop any day over anything that _requires_ tooling (protobuf, gRPC). And I'll take it over the XML ecosystem too.

The faults of JSON seem, in practice, to be less harmful than the faults of other formats.

q3k · on Oct 11, 2021

My preference is Protobuf, but really anything that's not JSON and which also comes with some IDL gets my approval.

kortex · on Oct 11, 2021

I like protobuf for some use-casess (namely grpc) but a) it's a binary format and sometimes (often times) it's nice to have a text protocol

b) protobuf libraries and protoc have given me way more grief overall than json (python, js, c++)

If your workflow already supports it, I can see it being useful, but it's got a pretty steep learning curve to be honest, certainly more than json, despite the ill-implemented libs out there. If I wanted a binary format, IMHO I'd go for msgpack first, and reach for protobuf if that didn't work for me.

elteto · on Oct 11, 2021

> I like protobuf for some use-casess (namely grpc) but a) it's a binary format and sometimes (often times) it's nice to have a text protocol

Protobuf (and flatbuffers) supports parsing messages from JSON instead of a binary blob. Best of both worlds IMO.

avmich · on Oct 11, 2021

Can you use JSON Schema? Generating classes from it, if you want native objects?

q3k · on Oct 11, 2021

You can use whatever you want :).

I personally would rather still go with Protobuf if I'm going to put in the effort to add a schema and codegen. It gives me other nice-to-have features (faster [de]serialization, smaller messages, field numbers and schema evolution, nicer IDL [not JSON!], gRPC, ...) and does away with some problems intrinsic to JSON that no schema system will fix (terrible number type, lack of binary type, slow parsing). It also has some interop with JSON in the rare case you absolutely positively need to convert to/from it (which is IMO the only upside of using JSON Schema in case you need that interop).

NavinF · on Oct 11, 2021

https://capnproto.org/

0xbadcafebee · on Oct 11, 2021

YAML. Of course implementations of this go all over the place too, but you could say the same of XML parsers to a certain extent.

I still pine for binary-only data formats. They're easier to program, and nobody makes the mistake of trying to edit them manually or compose them in a shell script. Parsing data shouldn't be hard, but it also shouldn't be so easy that people hang themselves by accident.

Of course, the reason why we largely have text data formats is because it's insanely simpler to troubleshoot systems that use them. Some things should just be easier to manipulate. But for general purpose work, I miss binary data formats.

Zip is probably my favorite general-purpose binary data format. It's old, well defined, works with any kind of data, and you can immediately seek to data in very large archives rather than having to parse the entire thing first. And then there's that whole compression thing. If you wanted to distribute a thousand tiny blobs of CSV, JSON, YAML, and XML, all in one container, you could do much worse than Zip.

rjh29 · on Oct 11, 2021

I've had a number of negative experiences with yaml, enough to put me off using it. For example the implicit parsing of 'yes' and 'no' into bools rather than strings (including the NO country code for Norway) <https://hitchdev.com/strictyaml/why/implicit-typing-removed/>, the no-quote rules allowing accidental creation of inline hashes/arrays <https://hitchdev.com/strictyaml/why/flow-style-removed/>, multiline string syntax so complex that it needs a helper tool <http://yaml-multiline.info/>, and powerful extensions that invite your program to be exploited <https://www.sitepoint.com/anatomy-of-an-exploit-an-in-depth-...>

It manages to be both a poor data interchange language compared to JSON, and also a bad human-friendly langage due to the above ambiguities.

Unfortunately it's still the best human-friendly configuration language in wide use, so I use strictyaml (https://hitchdev.com/strictyaml/) instead.

kortex · on Oct 11, 2021

NO is not even the worst of it. `on`, like in github actions, is interpreted as True by PyYaml by default. You have to either quote it, "on", or set certain configs I haven't bothered with just yet.

I fully agree YAML is just...not good as a transport/interchange serde.

Personally, I actually really like HCL as a human-friendly config language, but it's got challenges in writing it, and thus support in most languages, if even present, is read-only.

Will look into strictyaml!

0xbadcafebee · on Oct 12, 2021

Sorry, I should have been more explicit: YAML should never, ever, be edited by a human. Neither should JSON. But they can be read by a human, and copy+pasted, which makes them easier to troubleshoot.

All of the problems you list of YAML are due to humans making assumptions, rather than having a program do all of the serialization/deserialization. YAML has a ton of useful data structures and they do not cause problems when you use a real parser to populate or interpret them. It's also a superset of JSON, so it's clearly not poor in comparison to JSON.

It's also not a configuration language. It's a data serialization format.

rjh29 · on Oct 15, 2021

That might be what the authors want (the 1.0 spec mentioned configuration, the 1.2 spec mostly mentions data serialization) but in terms of actual use, it's pretty widely used as a configuration language. There would be no need for optional quote rules, inline hashes/lists or multiple flow styles (all of which YAML supports) if it was purely a serialization language.

If you make something human-readable, humans will edit it. The only reliable way to prevent that is to make the serialisation format binary, IMO.

YAML also supports a number of really powerful features like you point out (tags, anchors etc.) but I don't see them widely used due to security risks with untrusted input, and interlanguage compatibilty problems. So personally I don't think it's a super good data serialisation language either, despite its power, at least compared to something like Protobuf.

richardwhiuk · on Oct 11, 2021

YAML has all of the problems of JSON with some of the problems of XML, and some new ones thrown in. Avoid.

kortex · on Oct 11, 2021

Zip isn't a binary data structure protocol though, it just provides a compression protocol. I'd argue that while zip is technically older than gzip (3 years, 89 vs 92), it was proprietary for much of its history, and thus gz is an older "standard".

0xbadcafebee · on Oct 12, 2021

None of these are protocols. Zip is an archive format for files (including binaries), not a 'compression protocol'. It doesn't even have to use compression.

As far as 'standards' go, literally nothing uses Gzip/DEFLATE outside of Linux machines and HTTP clients/servers. Gzip isn't even an archive format.

prionassembly · on Oct 11, 2021

Is anyone sending sqlite binary blobs over the wire? Foreign keys as a replacement for recursive arrays sounds like a win...

BerislavLopac · on Oct 11, 2021

TOML?

lifthrasiir · on Oct 11, 2021

And all these problems trace back to Douglas Crockford. He didn't know how to make a proper serialization format [1] and also an interoperable standard (for the latter, Tim Bray tried very hard to make it slightly better [2]). He just noticed that a (supposed) subset of JavaScript can be easily turned into a serialization format with `eval` and went to publicize it, only noticing the issues later and still pursuring its standardization as is. I hate him.

[1] My additional complaints: https://news.ycombinator.com/item?id=24953981

[2] https://www.tbray.org/ongoing/When/201x/2014/03/05/RFC7159-J...

gmac · on Oct 11, 2021

I was using JSON before it was 'invented', as was basically anyone sending data to the browser in JS format.

Holding specific people responsible is pretty absurd.

lifthrasiir · on Oct 11, 2021

JSON before the standardization had an obvious data model and specification: ECMAScript. (I don't think JSON was widely used outside of JavaScript back then.) ECMAScript is particularly strictly defined even compared to other language standards, so it should have been possible to extract the relevant portions of ECMAScript into a proper standard. Crockford didn't. JSON as specified by Crockford was not even a proper subset of ECMAScript until ECMAScript itself retrofitted its syntax.

_bz2r · on Oct 12, 2021

> look into well designed and better specified alternatives

From an end user who knows little about these parsing issues, what alternatives are you happy with?

nuerow · on Oct 12, 2021

> My general opinion is that it's extremely hard to reliably use JSON as an interchange format reliably when multiple systems and/or parser implementations are involved.

None of the points you made were regarding JSON, and we're focused on either hypothetical or very specific implementation problems.

Even though it's true that you can experience problems if you pick a broken noncompliant implementation and fail to test, that's hardly a JSON issue. I mean, are we supposed to think that XML has a problem if you pick an implementation that fails to handle multiple entities or outputs broken XML?

SavantIdiot · on Oct 11, 2021

I deleted my snarky comment because I want to be more serious.

You never know when you hack something quick if it may become a future standard for trillions of handshakes!

JSON looks so clean and tidy on that business card, but when you look at both RFCs you realize: ZOMG there is a lot of stuff that needs to be thought out!!

There are some purists in this thread who claim it is the parsers' fault. I can almost get behind that, but not 100% because you really do need to be more clear about several things (different types of numbers, different character sets, minimum requirements -and- maximum requirements...)

I agree with OP that not including implicit version numbers was an oversight: it looks ugly, but if you're not going to put in all the thought in at the start, at least make a version number required so you can ignore mistakes.

Let this document be a lesson to anyone who writes a data schema/grammar that eventually replaces JSON.

admax88qqq · on Oct 11, 2021

I dunno man, serialization in general is fraught with peril. At least the JSON grammar is short and simple. I can't believe people in here are genuinely recommending XML as an alternative. Try standardizinh XML in a 14 page RFC.

Yes some things need to be though out, but I can sit down and read the JSON RFC start to finish easily.

SavantIdiot · on Oct 11, 2021

> I dunno man, serialization in general is fraught with peril.

I completely agree.

And I'm certain I'm using JSON in an unsafe way... somewhere. :)

Ginden · on Oct 11, 2021

And many people choose to ignore that number parsing issues can happen in XML too.

onion2k · on Oct 11, 2021

Most (all?) the complaints here appear to be that specific libraries fail to implement the JSON spec in the way that the author has interpreted it. Some libraries try to 'help' by parsing things that they shouldn't, and some fail to parse things they probably should.

This is why we end up with so many JSON parsing libraries I guess, but it's not really a problem with the format itself, beyond the fact that clearer specs might disambiguate things and lead to less deviation.

q3k · on Oct 11, 2021

> but it's not really a problem with the format itself, beyond the fact that clearer specs might disambiguate things and lead to less deviation.

It is a problem, because it's not a spec that can be implemented reliably. Different parsers behave differently on various corner cases not only because of implementation blunders, but also because the standard(s) just let them do whatever. This spectacularly breaks systems that use more than one parser implementation, each slightly implementing the standard slightly differently. One part of some processing/parsing pipeline will let some payload through, while another one will reject it, or even parse it differently.

horsawlarway · on Oct 11, 2021

I disagree (at least mostly).

This is a case where the spec is intentionally loose to allow compatibility with a much larger number of machines and use cases.

You'll notice most of the cases where the behavior is implementation defined have resource requirements. example: how deep you want to allow nesting depends a LOT on the capabilities of the machine running the code. A sane value for a modern browser is going to be unworkable on an arduino/ESP32/embedded other.

Also... if these ambiguities bother you, you probably haven't read the full http spec either. It's riddled with cases where behavior is implementation defined, for exactly the same reasons (resources are required, and you can't assume everyone has the same amount available). Want to take a guess at the maximum length for a url?

q3k · on Oct 11, 2021

> This is a case where the spec is intentionally loose to allow compatibility with a much larger number of machines and use cases.

There's plenty that could've been specified with little detriment to small systems: strings are UTF-8 with a well-defined escape sequence set, numbers are always IEEE-754 doubles, messages cannot be nested by more than 128 levels (or some other arbitrary number in this range), repeated fields are not permitted, everything non-compliant must fail the entire parse. Then the only thing left to handle is a maximum serialized size (which can be explicitly implementation or user defined). Set the maximum string length to maximum payload length defined earlier and you're golden. That is then your only difference between implementations.

This will work on your Ryzen server and on your ESP8266 or ESP32, and can even be handled on your washing machine microcontroller^W^W^WArduino (with a slowdown for dealing with floating point numbers, but you already have to deal with that).

Finally, the spec isn't loose because of some design choice to allow interoperability with more machines: it's loose because it was historically loose (see: JSON business card 'specification' chutzpah, which itself is based on a mess of a programming language that is/was JS), and before it could be formalized to something sensible it got implemented haphazardly by different languages. That doomed the format to forever be underdefined, as anything more strict would render existing implementations non-compliant.

horsawlarway · on Oct 11, 2021

But that attempt at strictness harms implementation value.

Even your own requirement set that you've claimed will work on everything is... bad - Sure I can parse every number as a double, if I'm willing to spend at least 64bits on every number in the payload.

I just finished building a PH Autodoser for a hydroponics system I run - it sends JSON payloads with sensor data and receives JSON commands to do things like dispense PHDown/PHUp solution, toggle on water cooling. I have very little spare working memory on the device doing the actual monitoring. having to hold 64 bits per number would push me into having to buy a more expensive microcontroller.

Instead - I have an informal contract that almost all fields are plain unsigned bytes (0 to 255) which works fine for my use-case, requiring just 1/8th the space.

And to go the other direction - I have a desktop running some financial software, I pass around json payloads there, but a double is NOT ENOUGH. I want a BigInt field for numbers there instead, because rounding errors that would be a-ok for a ph sensor are absolutely not ok for calculating financial data.

----

Basically - I want the flexibility to chose the correct interpretation for my data.

And this: "everything non-compliant must fail the entire parse." Is just fucking insanity. It's the literal antithesis of the robustness principle:

"be conservative in what you send, be liberal in what you accept"

q3k · on Oct 11, 2021

> Instead - I have an informal contract that almost all fields are plain unsigned bytes (0 to 255) which works fine for my use-case, requiring just 1/8th the space.

Right, but that informal contract is at the detriment of everyone else having to also specify the expected limits of numbers they work with. It makes your particular usecase easier, but it doesn't make the standard better in the grand scheme of things.

> And to go the other direction - I have a desktop running some financial software, I pass around json payloads there, but a double is NOT ENOUGH. I want a BigInt field for numbers there instead, because rounding errors that would be a-ok for a ph sensor are absolutely not ok for calculating financial data.

And JSON doesn't guarantee you that, you have to shop around for languages and implementations that permit this. If you then have to make work with an implementation that always deserializes to doubles (which is a compliant behaviour) or bytes (which is a compliant behaviour), you're screwed. Again, this might work for the simple case of you controlling both ends of the serialization, but it's terrible for trying to work with an end that you don't control (ie. when actually using JSON as an interchange format).

> And this: "everything non-compliant must fail the entire parse." Is just fucking insanity. It's the literal antithesis of the robustness principle: "be conservative in what you send, be liberal in what you accept"

The Robustness Principle followed blindly is known to be harmful when dealing with long-term standards, evolving implementations and the human element of software engineering [1]. My opinion is that an interchange format's job is to transfer some data reliably and atomically: the deserialized data should be either be 100% correct or the deserialization should be rejected. Anything else can and will lead to bugs, and bugs that are then difficult to solve (as at that point it's difficult to agree whether the serialization was not conservative enough, or the deserialization not liberal enough).

[1] - https://datatracker.ietf.org/doc/html/draft-iab-protocol-mai...

horsawlarway · on Oct 11, 2021

Ok - so now you have a very strict protocol, that never gains traction because the strictness you value hampers utility.

And yes - I'm aware of the "Bug for bug compatibility" problems that draft tries to highlight, but it's fairly clear that utility is paramount:

> As [SUCCESS] demonstrates, success or failure of a protocol depends far more on factors like usefulness than on on technical excellence. Timely publication of protocol specifications, even with the potential for flaws, likely contributed significantly to the eventual success of the Internet.

TeMPOraL · on Oct 12, 2021

> Ok - so now you have a very strict protocol, that never gains traction because the strictness you value hampers utility.

Well, things shouldn't be like that. This is the collective insanity of our industry - standardizing on tools that are easiest to work with when you don't understand them and are under time pressure to cut corners. It's the opposite of good engineering.

horsawlarway · on Oct 12, 2021

> Well, things shouldn't be like that.

Ain't that the story of this planet.

That said - I'm not really sure you can call it bad engineering. It turns out that people will use whatever is available to suit their needs (and god help you if you try to predetermine all the strange edge-cases they'll use it for... https://xkcd.com/1172/). For every one developer that's had to debug a json parser compatibility issue, there are probably a million other developers happily solving real world problems with JSON with no issues at all.

karmakaze · on Oct 11, 2021

What you're describing here is a schema-specific parser. Even if the parser succeeded, you would reject the input as the values are out of range. Making a custom parser for this is fine, but call it what it is a parser for a subset of JSON--it would fail for a value of 0.1 or -1.

horsawlarway · on Oct 11, 2021

Sure. The problem is that many things that a very strict spec might require have real resource requirements.

There's a reason the URL length in http is undefined, it's because the machine accepting the request doesn't have infinite memory. Even the latest spec is a simple recommendation to accept a request line of at least 8k octets.

You can say "We must support nesting depth of N" in json, but the reality of the situation is that parsers can and will just ignore you. Are they non-compliant? Sure. Are they useful? Sure.

Will people still use them? Fuck yes they will. Because utility trumps strictness in most cases.

josefx · on Oct 12, 2021

> , be liberal in what you accept

While some people are fine with garbage in / garbage out some of us prefer to filter out the garbage before we have to spend days debugging why something doesn't work as it should.

horsawlarway · on Oct 12, 2021

One man's garbage is another man's treasure.

If you don't like how the current parsers work, feel free to go write your own, because frankly "doesn't work as it should" is presumptuous bullshit. They work exactly as their current authors like, if that happens to not be correct for you... either put up or shut up.

josefx · on Oct 13, 2021

1) I wasn't trying to imply that the parsers themselves aren't working but that you might have to debug programs using these parsers more often as its not apparent that the issue lies in the input instead of the program having to handle it.

2) Pretentious would be if I liked the color blue and then claimed that "walls should be blue" is a robustness principle.

ChrisArchitect · on Oct 11, 2021

Surely something newer on this since 2016

Plenty of previous discussion:

2 years ago https://news.ycombinator.com/item?id=20724672

3 years ago https://news.ycombinator.com/item?id=16897061

5 years ago https://news.ycombinator.com/item?id=12796556

kstenerud · on Oct 11, 2021

Safety and security are two big reasons why I developed Concise Encoding [1]. The computing and networking landscape today is MUCH more hostile compared to the JSON and XML heyday (with state actors and organized crime now getting in on the action), and it's time to retire them in favor of more secure and predictable formats that are also human-friendly.

[1] https://concise-encoding.org

eatonphil · on Oct 11, 2021

On the topic of JSON and minefields, what is your experience using JSON5? I'm considered moving to it for configuration files in an application I'm building.

lifthrasiir · on Oct 11, 2021

JSON5 mostly extends JSON's syntax, not its data model (it still doesn't outlaw duplicate object keys for example).

eatonphil · on Oct 11, 2021

This article is about parsing though so I am mostly asking about that.

lifthrasiir · on Oct 11, 2021

"Parsing" can mean wildly different things indeed. In this case though the article does check duplicate keys and numeric range & precision, so the data model is definitely in question.

AnthonBerg · on Oct 11, 2021

I find it much more pleasant to work with.

bob1029 · on Oct 11, 2021

For JSON contracts that are of any reasonable level of complexity (many levels of nesting), I prefer to have the same serializer & strong type system on both ends. A common use case here is serializing dynamic business types as JSON in and out of blob columns.

For what its worth, I have had maybe 2 hours total worth of struggles with JSON serialization over the last as many years, and we use it for pretty much everything. The biggest pain point for us is implementation-specific. Refactors of namespaces & dependent assembly names can cause trouble with polymorphic serialization (which can absolutely be secure if used responsibly). The only other pain point experienced is with regard to nullable vs non-nullable fields - again only a problem after a change takes place relative to pre-existing JSON documents.

cryptica · on Oct 11, 2021

It seems like all the 'problematic' edge cases mentioned can easily be dealt with using runtime type validation and are not the concern of an interchange format like JSON which is (and should be) optimized for maximum flexibility/interoperability. The server should not trust the data inside JSON objects sent by remote clients; there should be some kind of runtime type validation; it's expected that different programming languages might interpret the content of the same JSON object slightly differently for certain unusual edge cases. IMO, as an interchange format, JSON should be allowed to evolve over time; JavaScript has already proven this model to be effective; you can always add features and add flexibility but cannot take away features or remove flexibility.

jmull · on Oct 11, 2021

This is a big problem for people writing general JSON processors/parsers.

But it's not too bad an issue for specific applications/systems using JSON...

They need their JSON to be in the correct form to represent their "business objects" (or whatever you want to call your application or system-specific data types), which is already a very restricted subset of JSON that a standard can't help with, and only rarely need to bump up against the oddness JSON has around the edges.

(Not that people won't bump up against these issues more than they really need to -- e.g, I recently saw someone trying to rely on multiple keys to mean something specific, which is a fun/interesting idea but is crazy to want to put into production... but good specs won't stop people from wanting to do crazy things.)

jillesvangurp · on Oct 12, 2021

Use a decent parser and most of those issues go away. If you do go down the path of writing your own parser, do your homework, obviously. Probably not worth the trouble unless you know what you are getting into. But if you do, maybe run some of the tests that people have produced over the years to produce reports like this.

tinus_hn · on Oct 11, 2021

It isn’t too bad, considering how many people on this site are still pining for the worst format in the world: csv

EdwardDiego · on Oct 11, 2021

Far easier than parsing Markdown at least.

benibela · on Oct 11, 2021

That is why I maintain my own JSON parser. First I started with the parse from FreePascal's standard library. Then I ran test cases on it, and there were lots of issues I had to patch.

First it was accepting all kinds of numbers, so I rewrote it to only accept the numbers from the spec

Then it was removing invalid \u escapes, while I needed it to replace them with U+FFFD.

Then I needed the unchanged input. Besides the test cases from the article, I ran test cases from the W3C XPath test suite. The W3C has a very odd understanding of JSON. Besides the normal numbers and Unicode U+FFFD replacement, the JSON parser must be able to parse it unchanged. That means, if the input number is like 100 or 1e2, the parser must be able to return that as string "100" or "1e2". Those are different numbers. And there must be a user defined replacement of invalid \u, like you set the replacement to identity and the input is "\uDEAD\u002D\udead", then the parser must parse that as "\uDEAD-\udead" while keeping the case.

Decabytes · on Oct 11, 2021

I’m a data scientist so I work with JSON and csv all the time. It’s amazing how the back bone of data serialization and reporting are so ambiguous.

But I wonder if I’m part of the probably. Know one notices all the inconsistencies because so much of my job is ironing it out.

Ginden · on Oct 12, 2021

How many of these inputs can be generated by JSON-generating libraries?

How many of these inputs can be generated by JSON-generating libraries on accident? 1? 2?

How many of these inputs can lead to exploitable vulnerabilities? 1 (duplicated keys handling inconsistencies)?

dlsa · on Oct 11, 2021

So many standards, for sure. But... parsing json is actually simple enough. You require those who send you data to comply with specific libraries during export and import. If they send a file which can't be imported then they sent a corrupted file. Bonus if you lock the version. Be as specific as you need to be.

There are people who will quibble around "there are thousands of libraries". No there aren't. There's just the N you support.

We specify all sorts of details for other aspects of computing. Why wouldn't we specify the data format as well? Change control / configuration management are very useful.

This is how you reduce pointless complexity. Nip it in the bud as early as possible.

EDIT: Not sure why people disagree with this comment. This is basic data management. Are people really asserting that we are NOT allowed to set a minimum standard? This is also called "setting boundaries".

belter · on Oct 11, 2021

Previous discussion:

2016: https://news.ycombinator.com/item?id=12796556

2018: https://news.ycombinator.com/item?id=16897061

2019: https://news.ycombinator.com/item?id=20724672

qualudeheart · on Oct 11, 2021

Can copilot parse json?

Ygg2 · on Oct 11, 2021

[flagged]

GuB-42 · on Oct 11, 2021

Sometimes, articles are reposted because nothing changed significantly.

We sometimes get articles from the 19th century that are still relevant today, and it is interesting to see our ancestors perspective on the problem, and an old article on a problem we still have today is a good indication that there is no easy fix and it won't go away anytime soon.

Ygg2 · on Oct 11, 2021

Ok, but nothing points this is still relevant. Were tests re-run or something? Last update was 3 years ago.

coldtea · on Oct 11, 2021

We don't need special tests and metrics to point us that this is still relevant. We know it is, and no, nothing has changed since.

Beyond this particular case, this is a social link-voting website. If people submit and vote for an older article, it will be in the front page, end of story. Doesn't matter if it still holds or not - it's enough that people found it still interesting to submit and upvote. Some of the better discussions here happen the nth time the same post is on the front page (and some posts get on the top page 5-10 times in a decade). There's also a handy link on HN to show previous submissions of the same post, and the discussions that ensued.

Ygg2 · on Oct 11, 2021

Ok, but what is the proof nothing has changed? I just see a repost, of a really good article.

No test-suit runs, not even a glib message saying "It's year 2021, and nothing in test suite has changed".

kergonath · on Oct 11, 2021

It’s here because someone posted it, and enough people upvoted it, and not enough flagged it. There is no conspiracy, things just show up on the front page depending on what we collectively want to read.

And it is an interesting article, and I hadn’t read it before, so I upvoted it as well.

The rule of thumb is that de-posting is acceptable after ~1 year.

coldtea · on Oct 11, 2021

>Ok, but what is the proof nothing has changed?

It's the so-called experimental proof. We see it every day in practice.

(This is not a research lab).

IggleSniggle · on Oct 11, 2021

Reposting doesn’t have the same negative connotation here as it might other places. Sometimes valuable insights come from older works. Sometimes a conversation is worth having on HN with a contemporary context. If it ends up on front page, folks are finding it valuable to discuss.

Ygg2 · on Oct 11, 2021

I don't understand the reasoning behind it, and my genuine question has been flagged.

Why is it posted now, rather than a year or two years ago?

peterkelly · on Oct 11, 2021

Only the person who posted it knows exactly why they chose this particular day to do so. Probably they came across it in the course of their work and thought it might be useful/interesting to have a discussion about it. Apparently a lot of other people agreed because they upvoted it.

Revisiting old articles every few years can be useful, because the set of people participating in the discussion is likely to be substantially different from those who commented on it the last time it appeared. Those people may have insights or information to share that weren't discussed previously. Maybe some of the people commenting here hadn't even started working in the industry when the original post was made. And even people who were part of the first discussion may have new thoughts on the topic.

As an example, while I was certainly aware of and using JSON at the time this article was written, and recall reading it at the time, it is actually much more relevant to me now because I am working on a project that uses JSON in a different way than what I'd done previously. Specifically, we rely on the fact that the same piece of data will always serialize to the exact same string, which we hash and use for later comparisons. We've run into issues relating to interoperability problems between different implementations exactly because of the issues discussed in the article (and yes I'd prefer to use an alternative format for these reasons). This is something I wouldn't have been concerned with at all on previous projects where it didn't matter if there were slight differences. That's just a datapoint on why I personally have renewed interest in this discussion.

If you look at a lot of other sites like Reddit (where reposts of articles are often discouraged), you'll often find that on question-based subs, many different people ask the same few kinds of questions with extreme regularity. Subs like /r/askreddit and /r/relationships are full of examples. HN is as much about discussion (if not more) than the actual articles themselves, and as mentioned above such discussions can offer something new each time. So as long as they're not repeated too often, reposts can still have value.

IggleSniggle · on Oct 11, 2021

I suspect you were flagged for seeming to complain that the submission was inappropriate. See the HN guidelines: https://news.ycombinator.com/newsguidelines.html

Posted now because it is still relevant. Previous posting/discussion:

5 years ago: https://news.ycombinator.com/item?id=12796556

3 years ago: https://news.ycombinator.com/item?id=16897061

2 years ago: https://news.ycombinator.com/item?id=20724672

MrBuddyCasino · on Oct 11, 2021

Old articles that consistently do well are periodically re-submitted by accounts that want to farm karma points. Why, I don't know.

petee · on Oct 11, 2021

Ive resubmitted a post that i knew was on HN a few years prior, but forgot just how eye opening it was at the time and thought that surely some missed it and would appreciate it again. You're probably right some people do it for points, but more likely its just more people fascinated by a specific topic, so more submissions.

My repost in particular was the ASCII/binary 4-column representation rather than the typical 3-col, which makes a big difference in understanding

MrBuddyCasino · on Oct 11, 2021

I should have been more elaborate. Your case is of course fine. But I noticed accounts with a very high karma count that submit a huge volume of articles, but hardly any comments. I don’t know exactly what’s going on, but I find it slightly weird.

account-5 · on Oct 11, 2021

Cynical and for some maybe true. Or it could be people are genuinely posting something they have just read for the first time and thought others might find it interesting. I count myself in this group; posting and finding this interesting. I do search before posting though others may not.

kergonath · on Oct 11, 2021

Resubmitting is one thing, but if it is upvoted, it means that at least some people find or interesting or valuable. If some people do, then resubmitting it was useful.

MrBuddyCasino · on Oct 11, 2021

This erodes the quality over time, as the platform becomes less useful to regulars and long-time members, and thus dis-incentives investment and care. Every open platform without rules devolves into a porn distributor, so strictly going by „what’s popular“ is problematic.

kergonath · on Oct 11, 2021

I hear you, but that’s the whole point of HN. Nobody takes editorial decisions.

It’s great if your interests are aligned with the community and as long as the noise is manageable by the voting system.

Besides, there is quite a bit of turnover on the front page. A post that you find useless today will probably be gone from the front page before tomorrow.

dec0dedab0de · on Oct 11, 2021

Edit: After seeing the comments, I checked my REPL history, and the bad data was still there. luckily with the spaces displayed as \040. Turns out the offending space was \240, which makes more sense. Please disregard this comment.

If you're curious, the problem stems from stuffing JSON in the description field of an external system to do our own tagging. Someone (me) must have copy/pasted from a screwy source. We were pulling out our hair trying to figure out what was wrong with it, and I just stuck it in the REPL, and saw the offending character was a space. Manually deleted the extra space and it was fine. A quick google showed space was part of the convention, and we were like "woah that's weird how did we never stumble across that before." I am embarrassed that I posted this now. My only excuse is that I just got over Covid, so I'm going with that :-).

This was my original comment:

After about a decade of using JSON I just discovered the hard way that you can only have one space after the colon between a key and value. Atleast with the python JSON library.

tyingq · on Oct 11, 2021

  $ python -c 'import json,sys;print(json.dumps(json.loads(sys.argv[1])))' '{"a":     "b"}'
  {"a": "b"}

Maybe it's been fixed? That's Python 3.8.10. What version is it broken in?

lcrz · on Oct 11, 2021

    Python 2.7.16 (default, May  8 2021, 11:48:02) 
    >>> import json
    >>> json.loads('{"a"  :     "b"    }')
    {u'a': u'b'}

dec0dedab0de · on Oct 11, 2021

In case either of you check your threads page, it was just a stupid mistake on my part. See my edit. Thank you for correcting me.

lifthrasiir · on Oct 11, 2021

Which version of Python and/or JSON library? Since Python's built-in `json` module was first introduced in 2.6 and I can't see any evidence of this bug throughout the relevant code (either pure Python or C implementations).

dec0dedab0de · on Oct 11, 2021

See my edit, it was just a stupid mistake on my part. Thanks for pointing this out.

pythonthecware · on Oct 11, 2021

[flagged]

Eikon · on Oct 11, 2021

What is “a language for the web”?

pythonthecware · on Oct 11, 2021

Any language that decodes json without issue. I mean it’s _the_ data format of the web ain’t it?

lcrz · on Oct 11, 2021

[flagged]

pythonthecware · on Oct 11, 2021

[flagged]

sebzim4500 · on Oct 11, 2021

I don't see any evidence that python has issues decoding json.