Hacker Newsnew | past | comments | ask | show | jobs | submit | olivia-banks's commentslogin

Yeah, I was wondering this as well. At the very least, this appears to be an Element requirement that was just enabled by a Matrix protocol update, so moving would be possible, but afaik Element is extremely popular as far as Matrix goes.

What exactly does this entail? I'm willing to be charitable in assuming that their use of "verify" isn't the modern usage of "give us your ID!" but I'm not enmeshed enough in the ecosystem anymore to know.

Respectfully, not even close. Verification is when I sign in from a new device, I use an existing device or second passphrase (either-or) to ensure that yes, it is me on both devices. I never have to reveal my ID, name, phone number, or email address to anyone. Not to Element, the Matrix Foundation, or the person running my home server where all my [encrypted] messages live.

My understanding is that there's two different types of verification.

Self-verification means that any new secondary devices you log into your account with will need to be verified by an existing login by way of an automatic popup that asks if you trust the device. It used to just be a Yes/No button but I think now they've added QR codes and/or emoji matching.

The other kind is verification between two different people, like when starting a direct message conversation, you might get the same emoji matching window to verify each other.


Yeah, IMO "verify" was a poor choice of wording for what this is. It has nothing to do with remote attestation or any other form of Treacherous Computing, and it has nothing to do with your real-life identity. It's just "go on your old device and confirm that the new device is really yours."

If you don't mind reading an essay, here is mine from the same discussion: https://news.ycombinator.com/item?id=45989744

What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd.

> What happens to it up the callstack?

as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.

so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.

that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.

and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.

that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.

in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.


Given that the bug was elsewhere in the system (the config file parser spuriously failed), it’s hard to justify much of what you suggested.

Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.

Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.

One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.

Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.


Writing bad code that doesn’t handle errors and doesn’t correctly model your actual runtime invariants doesn’t simplify anything other than the amount of thought you have to put into writing the code — because you’re writing broken code.

The solution to this problem wasn’t restarting the failing process. It was correctly modeling the failure case, so that then the type system forced you to correctly handle it.


The way I’ve seen this on a few older systems was that they always keep the previous configuration around so it can switch back. The logic is something like this:

1. At startup, load the last known good config.

2. When signaled, load the new config.

3. When that passes validation, update the last-known-good pointer to the new version.

That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.

For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.


Presumably you kick up the error to a level that says “if parsing new config fails, keep the old config”

The config file subsystem was where the bug lived, not the code with the unwrap, so this sort of change is a special case of “make the unwrap never fail and then fix the API so it is not needed”.

Yeah, see, that's what I mean.

This is incredibly annoying. I've been trying to fix a deployment action on GitHub for a the past bit, so my entire workflow for today has been push, wait, check... push, wait, check... et cetera.

You should really check out (pun intended) `act` https://github.com/nektos/act

I’ve tried! Most I’ve ever gotten was an inefficient way to fill some disk space and an ‘act’ that didn’t work :-)

A friend of mine was able to get through a few minutes ago, apparently. Everyone else I know is still fatal'ing.

The figures in this article are really great. How where they made? If I was to try and recreate them I might render things individually and then lay it out in Illustrator to get that 3D isomorphic look, but I assume there's a better way.

Completely unrelated, I trust.


I'm bound to get downvoted here, but I ran this by my own local model.

> No One Understands Software Because No One Understands Time

> All Programming Languages Converge to English Eventually

> The Best Database Is Just Two People Talking

> Stop Writing Code. Start Legislating Software

And my personal favorite:

> AI Safety Is Just the New Gluten-Free


It’s depressing how good these would be at getting clicks. Perhaps all article titles should be banned!


I would totally believe all of these to be real headlines if I saw them on Hacker News.


I know! Tempted to write them…


Please do!


This is awful. I work on epidemiological simulation software for a living, and while we've been running tons of simulations on a national/statewide scale, I had no idea it was that bad in Canada.


I really, really hope data can be recovered from this. I’ve read a bunch of the original sources, and such an ancient C would be especially interesting to study.

Very proud to have had this found at my University :-)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: