In the year I used it, I never personally noticed it going down. Although that being said, their SLA is only 99.9% delivery in any calendar month within 5 minutes. The penalty for missing that SLA is only 10% of that month's bill.
> Once an Incident is triggered, PagerDuty will deliver the First Responder Alert within the Notification Delivery Period for 99.9% of the notifications sent by PagerDuty for the Customer during any calendar month. The “Notification Delivery Period” is five (5) minutes and it is measured as the time it takes PagerDuty to deliver a First Responder Alert to telecommunication providers in accordance with the Service configuration and Contact Information.
> ...
> If PagerDuty fails to meet the SLA set forth herein, Customer may receive a service credit. Customer will be eligible for a credit toward future fees owed to PagerDuty for the PagerDuty Service. The Service Credit is calculated as ten percent (10%) of the fees paid for or attributable to the month when the alleged SLA breach occurred.
Zillow is a publicly traded company and has been since July 2011. I don't know how they raised capital before then, but regardless they're well past the VC stage.
This is a great point, Zillow has a P/E of ~170 and an ROE of 2.9%. It’s not inaccurate to say that directly lending money to individual buyers is a better deal for investors.
Sometimes there's limits to how high you can scale without overloading dependencies. For example, the database that the service accesses might only support n connections and the service is already using approximately n connections.
If the initial cost is x engineer hours and each hour costs $y, that implies the cost of the project is `x * $y + maintenance costs [of this system over the simpler, original approach]`
Is the increased speed worth `x * $y + maintenance costs`?
Could it pay off? Potentially yes. Will it actually pay off? I don't know since there's far too many variables to wildly speculate on. But I hope Netflix did the accounting at some point along the way to find out.
Not every improvement is worth building since the initial investment might grossly overshadow any potential gains (as a purely hypothetical situation, that's why you shouldn't spend $1 million up front to save $1/day for the next 30 years).
At $PRIOR_JOB, it always felt like the full E2E tests approached useless since for every bug successfully caught, it felt like there were ~20 false positives. At which point, everyone (myself included) blamed the tests and just repeatedly reran the tests until they usually passed. Every single failure would halt the pipeline anywhere from 5 minutes (in the case that rerunning the failed test shows that it was just a flaky test) up to multiple hours since everyone would rather try to diagnose/hotfix the issue rather than revert their code to unblock the pipeline.
With that being said, a full run of the E2E suite at $PRIOR_JOB took very, very low double digit minutes so it wasn't that expensive. Rerunning a handful of failed tests took single digit minutes so it wasn't too terrible.
Was in a similar situation, and the VP of engineering banned the practice of rerunning failed tests, so flaky tests caused everybody pain. In less than 8 weeks the false positive rate dropped by about 3 orders of magnitude. There's a strong tendency to treat tests as a hurdle to get over rather than to treat them as first-class part of the development process.
I imagine this would just turn into everyone inserting 10 second pauses on the tests that fail. Which works, but now your suite doubles the run time. Actually turning nondeterministic tests into deterministic ones is... hard. Really hard in some cases. Many devs don't even understand how to get there, even after years of E2E experience.
One place I worked, the E2E suite took a full hour to run. Everyone reran the tests. Merges took a full day in many cases. Management tried to force people to fix broken tests. But they also required new tests on new features. So it was a constant treadmill. There was basically a full mutiny by the end and the company killed off their entire E2E suite.
If people just started throwing random sleeps into tests, I think management would shit a brick. Do people throw random sleeps into production code to fix bugs where you work as well?
Not GP, and fortunately not often, but I have seen that done to overcome race conditions. I pushed for it to be corrected by using a proper design. That was a stupidly hard fight, though.
My pet peeve is people sprinkling C's "volatile" keyword in places. Since doing so inhibits many optimizations, it changes the timing and can make race conditions appear to go away.
Yep. Lots of effective ways to paper over issues without actually resolving them, and often disguising them so that resolution becomes nearly impossible later.
Worst, things like the introduced sleeps in some of the systems look legit. There are reasonable times to introduce a timed delay into your program (3rd party APIs have a rate limit, 1 request per second or 10 per 30 seconds or whatever). Depending on how you introduce these extra sleeps, then, it's possible that they'll look like they satisfy a valid requirement, when the reality is that they exist to cover up the absence of things like proper use of locks/mutexes or other elements.
Right if you have flaky tests there are 3 acceptable responses:
1. Fix the test
2. Fix the code that is being tested
3. Say "well we don't need this software to be reliable anyways so let just stop running tests"
But many places seem to adopt hidden option #4 "Run the tests and ignore failures"
A related issue is dialing the tunables for warnings up to 11 and then not reading any of the warnings. Once I saw a case where the build generated 1000s of warnings. Found a bug and said "this would be flagged as a warning even with relatively low warning settings" sure enough it was.
Obviously fixing warnings is good, but if they had just lowered the warning setting to be something reasonable, they would have had maybe 10 warnings total, one of which was a bug, which is a lot more useful than 1000s of warnings, at least one of which was a bug.
Option #4 is just option #3 but keeping the costs of running tests you ignore.
You're right about excessive warnings, but then sometimes note. Running `gcc -Wall` used to be considered madness, and if you did it now on a codebase that has been around a while and not kept clean, you'd drown in messages. The key is to turn it on from the very start and fix things when there are 10 warnings instead of 1000.
This decay happens with test suites, too. One or two tests start to fail, and instead of fixing them, people ignore the failures. A bit later, it's five tests, then 10, and pretty soon the programmers see the tests as broken instead of looking at the failures that let things get to the point where there are so many failing tests.
The fix for both situations is similar though; dial down the {warning strictness|number of tests run} until you get a clean {warnings|test-run} then enable them one by one in order of how easy they are to fix.
Obviously the E2E tests were really badly implemented. Implementing solid E2E tests is a skill that needs to be learned like any other software development skill. Most developers don’t know how to do it well.
It's still the case in most (all?) of the US. However, it's generally pretty well accepted that the documentation requirements are lax and that parents can opt out for a variety of reasons.
When I was in school (both K-12 and college), my schools just accepted a paper document (trivially forgeable) that stated what vaccines I had received and when.
Checked exceptions are horrible to work with, but they require the programmer to do something about them at the level immediately prior while as unchecked exceptions could just bubble up to an unexpected point in the stack which is arguably worse.
On the other hand, it teaches rapid prototyping and just trying out theories since each theory is fairly cheap to test out.
There's a somewhat popular TED talk on how kindergarteners frequently outcompete adults in the marshmallow challenge (tl;dr; build the tallest structure possible using assorted materials but a marshmallow must be on top) because, among other reasons, they're willing to just try out different approaches and fail fast rather than take a long time to think up of an approach that ultimately fails.
> Once an Incident is triggered, PagerDuty will deliver the First Responder Alert within the Notification Delivery Period for 99.9% of the notifications sent by PagerDuty for the Customer during any calendar month. The “Notification Delivery Period” is five (5) minutes and it is measured as the time it takes PagerDuty to deliver a First Responder Alert to telecommunication providers in accordance with the Service configuration and Contact Information.
> ...
> If PagerDuty fails to meet the SLA set forth herein, Customer may receive a service credit. Customer will be eligible for a credit toward future fees owed to PagerDuty for the PagerDuty Service. The Service Credit is calculated as ten percent (10%) of the fees paid for or attributable to the month when the alleged SLA breach occurred.
https://www.pagerduty.com/standard-service-level-agreement/