> Well-managed companies blame processes rather than people I feel like this jus...

kryogen1c · on Oct 4, 2021

> this just obfuscates the fact that individuals are ultimately responsible

in critical systems, you design for failure. if your organizational plan for personnel failure is that no one ever makes a mistake, that's a bad organization that will forever have problems.

this goes by many names, like the swiss cheese model[0]. its not that workers get to be irresponsible, but that individuals are responsible only for themselves, and the organization is the one responsible for itself.

[0] https://en.wikipedia.org/wiki/Swiss_cheese_model

Ansil849 · on Oct 4, 2021

> is that no one ever makes a mistake

This isn't what I'm saying, though. The thought I'm trying to express is that if no individual accountability is done, it allows employees who are not as good at their job (read: sloppy) to continue to exist in positions which could be better occupied by employees who are better at their job (read: more diligent).

The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.

zaat · on Oct 4, 2021

If someone is sloppy and not willing to change he should be shown the door, but not because he caused outage but because he is sloppy.

People who operate systems under fear tend to do stupid things like covering up innocent actions (deleting logs), keep information instead of sharing it etc. Very few can operate complex systems for long time without doing mistake. Organization where the spirit is "oh, outage, someone is going to pay for that" wiil never be attractive to good people, will have hard time adapting to changes and to adopt new tech.

tonfa · on Oct 4, 2021

> The difference between having someone who always triple-checks every parameter they input, versus someone who never double-checks and just wings it. Sure, the person who triple-checks will make mistakes, but less than the other person. This is the issue I'm trying to get at.

If you rely on someone triple-checking, you should improve your processes. You need better automation/rollback/automated testing to catch things. Eventually only intentional failure should be the issue (or you'll discover interesting new patterns that should be protected against)

whermans · on Oct 4, 2021

If there is an incident because an employee was sloppy, the fault lies with the hiring process, the evaluation process for this employee, or with the process that put four eyes on each implementation. The employee fucked up, they should be removed if they are not up to standards, but putting the blame on them does not prevent the same thing from happening in the future.

zaat · on Oct 4, 2021

If you'd think about it, it isn't very useful to find a person who is responsible. Suppose someone cause outage or harm, due to neglect or even bad intentions, either the system will be setup in a way that the person couldn't cause the outage or that in time it will be down. To build truly resilient system, especially on global scale, there should never be an option for a single person to bring down the whole system.

nerdawson · on Oct 4, 2021

By focusing on the process, lessons are learned and systems are put in place which leads to a cycle of improvement.

When individuals are blamed instead, a culture of fear sets in and people hide / cover up their mistakes. Everybody loses as a result.

alex_sf · on Oct 4, 2021

I don't think the comment you're replying to applies to your concern about subpar employees.

We blame processes instead of people because people are fallible. We've spent millenia trying to correct people, and it rarely works to a sufficient level. It's better to create a process that makes it harder for humans to screw up.

Ansil849 · on Oct 4, 2021

Yes, absolutely, people make mistakes. But the thought I was trying to convey is that some people make a lot more mistakes than others, and by not attributing individual fault these people are allowed to thrive at the cost of having less error-prone people in their position. For example, someone who triple-checks every parameter that they input, versus someone who has a habit of just skimming or not checking at all. Yes the triple-checker will make mistakes too, but way less than the person who puts less effort in.

mynameisvlad · on Oct 4, 2021

But that has nothing to do with blaming processes vs people.

If the process in place means that someone has to triple check their numbers to make sure they’re correct, then it’s a broken process. Because even that person who triple checks is one time going to be woken up at 2:30am and won’t triple check because they want sleep.

If the process lets you do something, then someone at some point in time, whether accidentally or maliciously, will cause that to happen. You can discipline that person, and they certainly won’t make the same mistake again, but what about their other 10 coworkers? Or the people on the 5 sister teams with similar access who didn’t even know the full details of what happened?

If you blame the process and make improvements to ensure that triple checking isn’t required, then nobody will get into the situation in the first place.

That is why you blame the process.

samhw · on Oct 4, 2021

Yeah, I've heard this view a hundred times on Twitter, and I wish it were true.

But sadly, there is no company which doesn't rely, at least at one point or another, on a human being typing an arbitrary command or value into a box.

You're really coming up against P=NP here. If you can build a system which can auto-validate or auto-generate everything, then that system doesn't really need humans to run at all. We just haven't reached that point yet.

Edit: Sorry, I just realised my wording might imply that P does actually equal NP. I have not in fact made that discovery. I meant it loosely to refer to the problem, and to suggest that auto-validating these things is at least not much harder than auto-executing them.

mynameisvlad · on Oct 4, 2021

I don’t think anyone ever claimed the process itself is perfect. If it were, we obviously would never have any issues.

To be explicit here, by blaming the process, you are discovering and fixing a known weakness in the process. What someone would need to triple check for now, wouldn’t be an issue once fixed. That isn’t to say that there aren’t any other problems, but it ensures that one issue won’t happen again, regardless of who the operator is.

If you have to triple check that value X is within some range, then that can easily be automated to ensure X can’t be outside of said range. Same for calculations between inputs.

To take the overly simplistic triple check example from before, said inputs that need to be triple checked are likely checked based on some rule set (otherwise the person themselves wouldn’t know if it was correct or not). Generally speaking, those rules can be encoded as part of the process.

What was before potentially “arbitrary input” now becomes an explicit set of inputs with safeguards in place for this case. The process became more robust, but is not infallible.

But if you were to blame people, the process still takes arbitrary input, the person who messed up will probably validate their inputs better but that speaks nothing of anyone else on the team, and two years down the line where nobody remembers the incident, the issue happens again because nothing really has changed.

samhw · on Oct 4, 2021

The issue is that this view always relies on stuff like "make people triple check everything".

- How does that relate to making a config change?

- How do you practically implement a system where someone has to triple check everything they do?

- How do you stop them just clicking 'confirm' three times?

- Why do you assume they will notice on the 2nd or 3rd check, rather than just thinking "well, I know I wrote it correctly, so I'll just click confirm"?

I don't think rules can always be encoded in the process, and I don't see how such rules will always be able to detect all errors, rather than only a subset of very obvious errors.

And that's only dealing with the simplest class of issues. What about a complex distributed systems problem? What about the engineer who doesn't make their system tolerant of Byzantine faults? How is any realistic 'process' going to prevent that?

This entire trope relies on the fundamental axiom that "for any individual action A, there is a process P which can prevent human error". I just don't see how that's true.

(If the statement were something like "good processes can eliminate whole classes of error, and reduce the likelihood of incidents", I'd be with you all the way. It's this Twitter trope of "if you have an incident, it's a priori your company's fault for not having a process to prevent it" which I find to be silly and not even nearly proven.)

ric2b · on Oct 4, 2021

> and allows subpar employees to continue existing at an organization when their position could be filled by a more qualified employee.

Not really, their incompetence is just noticed earlier at the review/testing stages instead of in production incidents.

If something reaches production that's no longer the fault of one person, it's the fault of the process and that's what you focus on.