> The AFR for 2020 dropped below 1% down to 0.93%. In 2019, it stood at 1.89%. That’s over a 50% drop year over year... In other words, whether a drive was old or new, or big or small, they performed well in our environment in 2020.
If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year? Better cooling, different access patterns, etc.
If this change doesn't have an obvious root cause, I'd be interested in finding out what it is if I were Backblaze. It could be something they could optimize around even more.
I was wondering if the state of the world in 2020 might have dramatically changed their business / throughput / access patterns in a meaningful enough way to cause this dip.
I'm not sure if they have a measure of the disk utilization or read/write load along with the failure rate.
Disclaimer: I work at Backblaze, but mostly on the client that runs on desktops and laptops.
> I'm not sure if Backblaze has a measure of the disk utilization or read/write load along with the failure rate.
We publish the complete hard drive SMART stats for anybody to attempt these analysis. Most of us in Backblaze engineering get giddy with excitement when a new article comes out that looks at correlating SMART stats and failures. :-) For example, this article circulated widely at Backblaze a few days ago: https://datto.engineering/post/predicting-hard-drive-failure...
Wow- crazy to see people at Backblaze actually read that article!!
Thank you guys for putting out so much data and writing so much about your findings- it was HUGE in helping me come to conclusions about what's realistic to assume from SMART stats. Y'all are doing some really really cool stuff.
I anticipate these reports every year and have strong trust in the data - I want to make that clear - Backblaze has done a massive service to the entire industry by collecting and aggregating this kind of data.
I'm really super curious about the dip in errors over the past year :)
Whether intentional or not, it's also great word-of-mouth advertising. My preexisting experience with Backblaze's hard drive stats reporting definitely worked positively in their favor when I was looking for a new backup service.
> Whether intentional or not, it's also great word-of-mouth advertising.
Oh, this has really worked out for Backblaze and we know it.
The first time we published our drive failure rates (I think January of 2014?) a few people said, "Uh oh, now Backblaze will get sued by the drive manufacturers." And we cringed and waited. :-) But the lawsuit never came, in fact there were NO repercussions, only increased visibility. People who have never heard of our company before find the data interesting, and then they ask "hey, what does this company do to own this many drives?" And a few (like yourself) sign up for the service.
Existing customers seem to stick with us for a long time, and even recommend us to other friends and family from time to time. So one tech person who stumbles across these stats might ACTUALLY bring us 3 or 4 more customers over the next 5 years. That's real money to us.
Not only have the drive manufacturers not sued us, they are actually NICE to us beyond the scale of our actual drive purchases! In one amusing example, our drive stats were used in a lawsuit as evidence. To be clear Backblaze was not the plaintiff or the defendant in the court case, we had no skin in the game at all and didn't want to be involved, but our drive data (and internal emails) were subpoenaed to be entered into evidence. Before we were served, the drive manufacturer called us and apologized for the inconvenience and made it clear they had no beef with us. Yes, a multi-national company that makes BILLIONS of dollars per year called a 40 person company (at the time) that could barely make payroll each month to apologize for the inconvenience. :-) We thought it was very considerate of them, and a little amusing. I'm proud to be the one that "signed" the papers indicating Backblaze had been "served".
> noticed that you are a fellow Beaver. Go Beavs! ;-)
Ha! This is a silly novelty "Beaver Baseball Cap" I was given during my internship in 1988 at Hewlett-Packard in Corvallis: https://i.imgur.com/G0rPHGP.jpg For 32 years it has sat on my computer monitor or on a shelf nearby. Nobody really asks about it anymore.
I got pretty lucky on timing at OSU. The year before I was there they taught beginning programming on punch cards on a mainframe computer called a "CDC Cyber". In my freshman year I took Pascal on the brand new 1984 Macintoshes in the Computer Science department, while the engineering students still learned Fortran on the mainframe my freshman year.
In 1992 I got a job as a Software Engineer at Apple in Cupertino, and you can trace that straight back to my blind luck of starting my programming education at OSU on the Mac in the exact correct year. Well, I'd rather be lucky than good.
If they’ve hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.
There’s also just the possibility that failure rates are bimodal and so they’ve hit the valley of stability.
Are they tracking wall clock time or activity for their failure data?
> If they've hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.
Internally at Backblaze, we're WAY more likely to be spending time trying to figure out why drives (or something else like power supplies or power strips or any other pain point) is failing at a higher rate, than looking into why something is going well. I'm totally serious, if something is going "the same as always or getting better" it just isn't going to get much of any attention.
You have to understand that with these stats, we're just reporting on what happened in our datacenter - the outcome of our operations. We don't really have much time to do more research and there isn't much more info than what you have. And if we stumbled upon something useful we would most likely blog about it. :-)
So we read all of YOUR comments looking for the insightful gems. We're all in this together desperate for the same information.
Seems to me that every drive failure causes read/write amplification, so a small decrease in failure rates would compound. Have you folks done any other work to reduce write amplification this year?
The bottleneck in HDD in this scenario is bandwidth. What you do is split & spread files as much as possible, so your HDD are all serving the same amount of bandwidth. A disk doing nothing is wasted potential bandwidth (unless it's turned off).
But do they actively move around files to spread bandwidth after the initial write? If they don't, and if I am right that older files tend to be rarely accessed, I would expect entire disks to become unaccessed over time.
If they allow that to happen, they are leaving a ton of money on the table. It's typical in the industry to move hot and cold files around to take advantage of the IOPS you already paid for. See, for example, pages 22-23 of http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...
I’m just assuming that folks doing archival storage aren’t using these kinds of spiny disks as it would be super expensive compared to other mediums, right?
I do think access patterns in general should contribute to the numbers so that kind of thing can be determined.
Compared to what, exactly? Tape is cheaper per GB, but the drives and libraries tip that over the other way. Blu-Ray discs are now more expensive per GB than hard drives, thanks to SMR and He offerings.
Also note that Backblaze does backups -- by definition, these are infrequently accessed, usually write-once-read-never. I've personally been a customer for three years and performed a restore exactly once.
Despite claims to the contrary, tape isn't dead just yet. They are still considerably cheaper than drives. An LTO-8 tape (12TB uncompressed capacity) can be had for about $100, while a 12TB HDD goes for some $300. Tape drives/libraries are quite expensive though, but that just shifts the break-even point out. For the largest sites, its still economical. Not sure, if backblaze is big enough (I'm sure they did their numbers). backglacier anyone?
And a number of the library vendors' libraries last for decades with only drive/tape swaps along the way. The SL8500 is on its second decade of sales for example. Usually what kills them is the vendor deciding not to release firmware updates to support the newer drives. The stock half inch cartridge form factor dates from 1984 with DLT & 3480. Given there have been libraries with grippers capable of moving a wide assortment of DLT/LTO/TXX/etc cartridges at the same time. Its doubtful if that will change anytime in the future. So if you buy one of the big libraries today it will likely last another decade or two, maybe three. There aren't many pieces of IT technology you can utilize that long.
I bought a pair of 12tb drives for $199 the other day and they often go cheaper. Now admittedly if you shuck externals you loose warranty but we are keeping them in the enclosures as these are for backups and thus the ease of taking them off site is great for us.
Are you asking for service for the drive itself or for the drive within an enclosure?
Let's say you buy a Ford Explorer SUV. You remove the engine and use it somewhere else for a few years. If that engine breaks within the warranty period of the SUV, can you take just the engine to your local Ford dealer and ask them to fix it? Probably not.
Arguably, you can put the engine back into the SUV and take the entire car to a Ford dealer for repair. But would that constitute fraud on your part?
I was specifically thinking of the SKUs - I assumed they were using faster disks rather than high volume disks that make trade-offs for costs. Just assumptions on my part - and I am mostly curious for more data, but given the historical trends, I'm not terribly suspicious of the actual results here.
Drive enclosures, raid/etc interfaces, and motherboards burning electricity make it a lot more complex than raw HD's vs raw tape. Tape libraries cost a fortune, but so do 10+ racks of cases+power supplies+servers needed to maintain the disks of equal capacity.
Tape suffers from "enterprise" which means the major vendors price it so that its just a bit cheaper than disk, and they lower their prices to keep that equation balanced because fundamentally coated mylar/etc wrapped around a spindle in an injection molded case is super cheap.
That seems like a misleading aggregation. Their total AFR can have been affected just by mix shift from early death to mid-life. It looks that way to me from their tables.
> If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year?
It can also be the case that newer drives this year are better than newer drives last year, while older drives are over a "hill" in the failure statistics, e.g. it could be the case that there are more 1st-year failures than 2nd-year failures (for a fixed number of drives starting the year).
What about air quality? There are actually air filters in a form of a patch or a package similar to miniature mustard packets through which drives breathe. Supposedly those are super fine filters but toxic gas molecules might still pass through them.
I guess these Hard Drive Stats post cover disks used for their B2 service as well? Maybe the service mix is changing (a larger percentage being used for B2 versus their traditional backup service).
I'm not sure how more B2 access pattern would improve the stat though.
> I guess these Hard Drive Stats post cover disks used for their B2 service as well?
Yes. The storage layer is storing both Backblaze Personal Backup files and B2 files. It's COMPLETELY interleaved, every other file might be one or the other. Same storage. And we are reporting the failure rates of drives in that storage layer.
We THINK (but don't know for certain) that the access patterns are relatively similar. For example, many of the 3rd party integrations that store files in B2 are backup programs and those will definitely have similar access patterns. However, B2 is used in some profoundly different applications, like the origin store for a Cloudflare fronted website. So that implies more "reads" than the average backup, and that could be changing the profile over time as that part of our business grows.
perhaps margin of error should be raised to accommodate this change of about 1%, although the set of drives under test is likely not the same between years
If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year? Better cooling, different access patterns, etc.
If this change doesn't have an obvious root cause, I'd be interested in finding out what it is if I were Backblaze. It could be something they could optimize around even more.