Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Have you worked with Go codebases before?

Standard practice is to wrap all your entrypoints - the start of a web request, the moment you begin to run a job - in a defer recover() which will catch panics.

Sadly, recover won’t apply to any subsequently goroutine’d work. That means even if your entrypoints recover, if anything go func()s down the call stack, then if that function then panics it will bring down the entire process.

We were aware of this but ended up including one by accident anyway. It’s very sad to me that you can’t apply a global process handler to ensure this type of thing doesn’t happen, but to my knowledge that isn’t possible.

Worth mentioning Go doesn’t really encourage ‘frameworks’, and most Go apps compose together various libraries to make their app, rather than using something that packages everything together. Failures like this are an obvious downside to not having the well reviewed nuts-and-bolts-included framework to handle this for you.



> Have you worked with Go codebases before?

Several but ...

> Standard practice is to wrap all your entrypoints [...] in a defer recover()

... I've never seen that. Is there some literature pointing to this as best practice?


You’ll find some libraries do this for you, such as HTTP servers.

They do this because if your server code makes a mistake such as accessing a nil pointer, a segfault or panic would bring down the entire process. That’s why you want to recover(), to avoid a your process dying.


I mean, net/http does it. That's the standard library.

A go convention that I've made up, or that is perhaps a real one, is to always look at the standard library for guidance on how to write Go. net/http suffers a little bit from being a very early library and thus you might not want to emulate its API surface. But in general, the Go team thought "you know what every HTTP handler in Go needs? recovery from panics" and that is worth some weight when considering your own design.

I would personally recommend fuzz testing your code, including HTTP handlers. The more panics you find in development, the fewer customers you lose from panics in production. Remember that recovering from a panic in an HTTP handler still means your user's request didn't get processed. They are not happy about that, even if your program can still service other users.


> They do this because if your server code makes a mistake

This is neither good or best practice.

My take:

- Know that there are simple-mistake panics, and internal-state-just-went-bonkers panics. For the latter, you can guess the boundary of impact (one request, one connection, one job, one userID, one process, ...) but exit(1) is much more reliable than guessing.

- Tests can easily catch simple mistakes like accessing a nil pointer.

- Know that the kind of programmer who does not care about simple tests, will also not care about concurrency bugs which introduce the more dangerous types of state corruption. This corruption would likely be not limited to a single request.

Good luck to frameworks who assume that nothing bad would happen if they ignore a panic and continue serving more requests.


> - Tests can easily catch simple mistakes like accessing a nil pointer.

no they can't . this is exactly what is hard to test for, complex state that can occur by some combination of many variables on many values


Respectfully, this is a really bad take. And one that flies in the face of the Go stdlib, given the default HTTP server will catch panics for you by default.


To me, it's unclear what the best solution is here. Other languages solve this differently with tradeoffs, e.g. Java defaults to threads silently dying when an exception isn't caught. Your program will continue to run, but it's probably in some undefined state at that point. There are mechanisms for propagating exceptions elsewhere, but they have to be explicitly set up (like in Go). You can set a default uncaught exception handler, but that's effectively a global variable with all the subsequent "fun", and the uncaught exception handler has to know how to clean up and restore state if the exception was thrown from anywhere, which seems generally difficult to do correctly.


Erlang/Elixir have a great story here: “let it crash”. Because each slice of activity in an application is wrapped in its own process (think single threaded loop but you can run a million at a time, almost free to create and destroy), if it crashes it only takes down that web request/process. Recovery mechanisms are built in to get back to a know good state.


I haven't used Erlang extensively, what happens if you crash in the middle of e.g. holding a lock, or during a coordinated dance with other processes?

My concern isn't really "does the program keep running?", it's "does the program keep running correctly?".


That sort of problem is beyond the scope of the runtime in any case, isn't it? In either of the examples you offered (holding a lock, coordinating with other processes), there must be timeouts enforced by the lock or the other processes so that, if something goes wrong, the system isn't waiting for a crashed process to continue the work.

Erlang/Elixir do make this pretty easy to manage, including the scenario where the process does recover by reverting back to a known good state. It won't do it for you automatically, but it exposes enough surface area to make problems like that solvable without reaching for a lot of extra tools - it's built into the runtime.


> That sort of problem is beyond the scope of the runtime in any case, isn't it?

Yes, which is why Go's outright crashing also makes sense to me...both Go and Erlang's behavior seem conceptually the same, with some architectural tradeoffs. It's not really that different for a process to die and restart. If some shared resource reaches an undefined state, then you have to kill everything and reset your state anyway. I suppose Go's behavior lends itself better to "microservices", whereas Erlang's behavior is better suited for "monolith" processes that do a lot of different things.

IMO either of these are better than Java's default behavior of silently swallowing the exception and allowing the thread to quietly die.


The key to Erlang error handling is that crashes should bubble up to a high level which then restarts everything below it in a known good state.

If you're in a coordinated dance with another process you link to that process. If a process you're linked to crashes then you crash too. There's no way to block yourself in Erlang such that you can't be told to crash.

After you crash your supervisor might restart you, if that's what you configured. Or you might give up on your specific task.


Pm2 for nodejs is the same.


> when an exception isn't caught

Not catching all exceptions is a glaring P0 bug.


You should almost never catch all exceptions, i.e. Throwable on the JVM. That is one of the few things that Scala really got right. The `catch NonFatal(e) =>` idiom is doing that nicely. It will catch all throwables with a selected set of special cases, e.g. OutOfMemoryException and all the other VirtulMachineErrors. Catching those in a framework lead to extending the time until a crash follows on a serious issue. Crashing early is often beneficial in such a situation. Together with a process watchdog, like systemd, kubernetes, dockerd, whatever crashing early increases the uptime.


Node.js changed behaviors over time.


I probably misunderstand what you wrote. Because I think a (wrapped) panic will only result in crashing the one request that caused it?

For example Gin provides a middleware wrapper for handling panics: CustomRecoveryWithWriter returns a middleware for a given writer that recovers from any panics and calls the provided handle func to handle it.

(https://github.com/gin-gonic/gin/blob/master/recovery.go)


The post-mortem we published about this outage explains the details of how this crashed the binary.

It gives some code examples and explanations that should clear this up: https://incident.io/blog/intermittent-downtime#mitigation-1-...

^ link should go to the relevant section


Will have a look, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: