Have you worked with Go codebases before? Standard practice is to wrap all your ...

zimpenfish · on April 24, 2023

> Have you worked with Go codebases before?

Several but ...

> Standard practice is to wrap all your entrypoints [...] in a defer recover()

... I've never seen that. Is there some literature pointing to this as best practice?

lawrjone · on April 24, 2023

You’ll find some libraries do this for you, such as HTTP servers.

They do this because if your server code makes a mistake such as accessing a nil pointer, a segfault or panic would bring down the entire process. That’s why you want to recover(), to avoid a your process dying.

jrockway · on April 24, 2023

I mean, net/http does it. That's the standard library.

A go convention that I've made up, or that is perhaps a real one, is to always look at the standard library for guidance on how to write Go. net/http suffers a little bit from being a very early library and thus you might not want to emulate its API surface. But in general, the Go team thought "you know what every HTTP handler in Go needs? recovery from panics" and that is worth some weight when considering your own design.

I would personally recommend fuzz testing your code, including HTTP handlers. The more panics you find in development, the fewer customers you lose from panics in production. Remember that recovering from a panic in an HTTP handler still means your user's request didn't get processed. They are not happy about that, even if your program can still service other users.

kubanczyk · on April 24, 2023

> They do this because if your server code makes a mistake

This is neither good or best practice.

My take:

- Know that there are simple-mistake panics, and internal-state-just-went-bonkers panics. For the latter, you can guess the boundary of impact (one request, one connection, one job, one userID, one process, ...) but exit(1) is much more reliable than guessing.

- Tests can easily catch simple mistakes like accessing a nil pointer.

- Know that the kind of programmer who does not care about simple tests, will also not care about concurrency bugs which introduce the more dangerous types of state corruption. This corruption would likely be not limited to a single request.

Good luck to frameworks who assume that nothing bad would happen if they ignore a panic and continue serving more requests.

heywhatupboys · on April 24, 2023

> - Tests can easily catch simple mistakes like accessing a nil pointer.

no they can't . this is exactly what is hard to test for, complex state that can occur by some combination of many variables on many values

lawrjone · on April 24, 2023

Respectfully, this is a really bad take. And one that flies in the face of the Go stdlib, given the default HTTP server will catch panics for you by default.

cle · on April 24, 2023

To me, it's unclear what the best solution is here. Other languages solve this differently with tradeoffs, e.g. Java defaults to threads silently dying when an exception isn't caught. Your program will continue to run, but it's probably in some undefined state at that point. There are mechanisms for propagating exceptions elsewhere, but they have to be explicitly set up (like in Go). You can set a default uncaught exception handler, but that's effectively a global variable with all the subsequent "fun", and the uncaught exception handler has to know how to clean up and restore state if the exception was thrown from anywhere, which seems generally difficult to do correctly.

brentjanderson · on April 24, 2023

Erlang/Elixir have a great story here: “let it crash”. Because each slice of activity in an application is wrapped in its own process (think single threaded loop but you can run a million at a time, almost free to create and destroy), if it crashes it only takes down that web request/process. Recovery mechanisms are built in to get back to a know good state.

cle · on April 24, 2023

I haven't used Erlang extensively, what happens if you crash in the middle of e.g. holding a lock, or during a coordinated dance with other processes?

My concern isn't really "does the program keep running?", it's "does the program keep running correctly?".

brentjanderson · on April 24, 2023

That sort of problem is beyond the scope of the runtime in any case, isn't it? In either of the examples you offered (holding a lock, coordinating with other processes), there must be timeouts enforced by the lock or the other processes so that, if something goes wrong, the system isn't waiting for a crashed process to continue the work.

Erlang/Elixir do make this pretty easy to manage, including the scenario where the process does recover by reverting back to a known good state. It won't do it for you automatically, but it exposes enough surface area to make problems like that solvable without reaching for a lot of extra tools - it's built into the runtime.

cle · on April 29, 2023

> That sort of problem is beyond the scope of the runtime in any case, isn't it?

Yes, which is why Go's outright crashing also makes sense to me...both Go and Erlang's behavior seem conceptually the same, with some architectural tradeoffs. It's not really that different for a process to die and restart. If some shared resource reaches an undefined state, then you have to kill everything and reset your state anyway. I suppose Go's behavior lends itself better to "microservices", whereas Erlang's behavior is better suited for "monolith" processes that do a lot of different things.

IMO either of these are better than Java's default behavior of silently swallowing the exception and allowing the thread to quietly die.

iudqnolq · on April 24, 2023

The key to Erlang error handling is that crashes should bubble up to a high level which then restarts everything below it in a known good state.

If you're in a coordinated dance with another process you link to that process. If a process you're linked to crashes then you crash too. There's no way to block yourself in Erlang such that you can't be told to crash.

After you crash your supervisor might restart you, if that's what you configured. Or you might give up on your specific task.

revskill · on April 24, 2023

Pm2 for nodejs is the same.

dboreham · on April 24, 2023

> when an exception isn't caught

Not catching all exceptions is a glaring P0 bug.

funcDropShadow · on April 25, 2023

You should almost never catch all exceptions, i.e. Throwable on the JVM. That is one of the few things that Scala really got right. The `catch NonFatal(e) =>` idiom is doing that nicely. It will catch all throwables with a selected set of special cases, e.g. OutOfMemoryException and all the other VirtulMachineErrors. Catching those in a framework lead to extending the time until a crash follows on a serious issue. Crashing early is often beneficial in such a situation. Together with a process watchdog, like systemd, kubernetes, dockerd, whatever crashing early increases the uptime.

paulddraper · on April 25, 2023

Node.js changed behaviors over time.

Gys · on April 24, 2023

I probably misunderstand what you wrote. Because I think a (wrapped) panic will only result in crashing the one request that caused it?

For example Gin provides a middleware wrapper for handling panics: CustomRecoveryWithWriter returns a middleware for a given writer that recovers from any panics and calls the provided handle func to handle it.

(https://github.com/gin-gonic/gin/blob/master/recovery.go)

lawrjone · on April 24, 2023

The post-mortem we published about this outage explains the details of how this crashed the binary.

It gives some code examples and explanations that should clear this up: https://incident.io/blog/intermittent-downtime#mitigation-1-...

^ link should go to the relevant section

Gys · on April 24, 2023

Will have a look, thanks!