> The AFR for 2020 dropped below 1% down to 0.93%. In 2019, it stood at 1.89%. T...

codezero · on Jan 26, 2021

I was wondering if the state of the world in 2020 might have dramatically changed their business / throughput / access patterns in a meaningful enough way to cause this dip.

I'm not sure if they have a measure of the disk utilization or read/write load along with the failure rate.

brianwski · on Jan 26, 2021

Disclaimer: I work at Backblaze, but mostly on the client that runs on desktops and laptops.

> I'm not sure if Backblaze has a measure of the disk utilization or read/write load along with the failure rate.

We publish the complete hard drive SMART stats for anybody to attempt these analysis. Most of us in Backblaze engineering get giddy with excitement when a new article comes out that looks at correlating SMART stats and failures. :-) For example, this article circulated widely at Backblaze a few days ago: https://datto.engineering/post/predicting-hard-drive-failure...

rachelwenzel · on Jan 27, 2021

Wow- crazy to see people at Backblaze actually read that article!!

Thank you guys for putting out so much data and writing so much about your findings- it was HUGE in helping me come to conclusions about what's realistic to assume from SMART stats. Y'all are doing some really really cool stuff.

atYevP · on Jan 30, 2021

Yev here -> of course we read it! It's a data-rich article about data failure! And the kudos is mutual - loved the post! :D

codezero · on Jan 26, 2021

I anticipate these reports every year and have strong trust in the data - I want to make that clear - Backblaze has done a massive service to the entire industry by collecting and aggregating this kind of data.

I'm really super curious about the dip in errors over the past year :)

rootusrootus · on Jan 26, 2021

Whether intentional or not, it's also great word-of-mouth advertising. My preexisting experience with Backblaze's hard drive stats reporting definitely worked positively in their favor when I was looking for a new backup service.

brianwski · on Jan 27, 2021

> Whether intentional or not, it's also great word-of-mouth advertising.

Oh, this has really worked out for Backblaze and we know it.

The first time we published our drive failure rates (I think January of 2014?) a few people said, "Uh oh, now Backblaze will get sued by the drive manufacturers." And we cringed and waited. :-) But the lawsuit never came, in fact there were NO repercussions, only increased visibility. People who have never heard of our company before find the data interesting, and then they ask "hey, what does this company do to own this many drives?" And a few (like yourself) sign up for the service.

Existing customers seem to stick with us for a long time, and even recommend us to other friends and family from time to time. So one tech person who stumbles across these stats might ACTUALLY bring us 3 or 4 more customers over the next 5 years. That's real money to us.

Not only have the drive manufacturers not sued us, they are actually NICE to us beyond the scale of our actual drive purchases! In one amusing example, our drive stats were used in a lawsuit as evidence. To be clear Backblaze was not the plaintiff or the defendant in the court case, we had no skin in the game at all and didn't want to be involved, but our drive data (and internal emails) were subpoenaed to be entered into evidence. Before we were served, the drive manufacturer called us and apologized for the inconvenience and made it clear they had no beef with us. Yes, a multi-national company that makes BILLIONS of dollars per year called a 40 person company (at the time) that could barely make payroll each month to apologize for the inconvenience. :-) We thought it was very considerate of them, and a little amusing. I'm proud to be the one that "signed" the papers indicating Backblaze had been "served".

rootusrootus · on Jan 27, 2021

Clicked on your profile and noticed that you are a fellow Beaver. Go Beavs! ;-). I got my CS degree from OSU about 10 years after you graduated.

brianwski · on Jan 27, 2021

> noticed that you are a fellow Beaver. Go Beavs! ;-)

Ha! This is a silly novelty "Beaver Baseball Cap" I was given during my internship in 1988 at Hewlett-Packard in Corvallis: https://i.imgur.com/G0rPHGP.jpg For 32 years it has sat on my computer monitor or on a shelf nearby. Nobody really asks about it anymore.

I got pretty lucky on timing at OSU. The year before I was there they taught beginning programming on punch cards on a mainframe computer called a "CDC Cyber". In my freshman year I took Pascal on the brand new 1984 Macintoshes in the Computer Science department, while the engineering students still learned Fortran on the mainframe my freshman year.

In 1992 I got a job as a Software Engineer at Apple in Cupertino, and you can trace that straight back to my blind luck of starting my programming education at OSU on the Mac in the exact correct year. Well, I'd rather be lucky than good.

And I've ALWAYS been lucky in my career.

willis936 · on Jan 26, 2021

There are other interesting factors to look for as well. Temperature, moisture, electrical noise on the power rails, infrasound, etc.

matmatmatmat · on Jan 26, 2021

Hi Brian, just a note of thanks to you and Backblaze for publishing these data. I always refer to them before a purchase and they're really helpful.

deeblering4 · on Jan 26, 2021

I wonder if datacenter covid restrictions had an impact?

Less movement, less vibration, less breathing, datacenter doors staying closed longer, generally less planned maintenance, etc etc.

cm2187 · on Jan 26, 2021

Or perhaps as time passes, a greater portion of their storage is rarely accessed archives so more disks in % are sitting doing nothing.

hinkley · on Jan 26, 2021

If they’ve hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.

There’s also just the possibility that failure rates are bimodal and so they’ve hit the valley of stability.

Are they tracking wall clock time or activity for their failure data?

brianwski · on Jan 26, 2021

Disclaimer: I work at Backblaze.

> If they've hit on a different access pattern that is more gentle, that might be something useful for posterity and I hope they dig into that possibility.

Internally at Backblaze, we're WAY more likely to be spending time trying to figure out why drives (or something else like power supplies or power strips or any other pain point) is failing at a higher rate, than looking into why something is going well. I'm totally serious, if something is going "the same as always or getting better" it just isn't going to get much of any attention.

You have to understand that with these stats, we're just reporting on what happened in our datacenter - the outcome of our operations. We don't really have much time to do more research and there isn't much more info than what you have. And if we stumbled upon something useful we would most likely blog about it. :-)

So we read all of YOUR comments looking for the insightful gems. We're all in this together desperate for the same information.

hinkley · on Jan 26, 2021

Seems to me that every drive failure causes read/write amplification, so a small decrease in failure rates would compound. Have you folks done any other work to reduce write amplification this year?

ddorian43 · on Jan 26, 2021

The bottleneck in HDD in this scenario is bandwidth. What you do is split & spread files as much as possible, so your HDD are all serving the same amount of bandwidth. A disk doing nothing is wasted potential bandwidth (unless it's turned off).

cm2187 · on Jan 26, 2021

But do they actively move around files to spread bandwidth after the initial write? If they don't, and if I am right that older files tend to be rarely accessed, I would expect entire disks to become unaccessed over time.

jeffbee · on Jan 26, 2021

If they allow that to happen, they are leaving a ton of money on the table. It's typical in the industry to move hot and cold files around to take advantage of the IOPS you already paid for. See, for example, pages 22-23 of http://www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Ke...

codezero · on Jan 26, 2021

I’m just assuming that folks doing archival storage aren’t using these kinds of spiny disks as it would be super expensive compared to other mediums, right?

I do think access patterns in general should contribute to the numbers so that kind of thing can be determined.

freeone3000 · on Jan 26, 2021

Compared to what, exactly? Tape is cheaper per GB, but the drives and libraries tip that over the other way. Blu-Ray discs are now more expensive per GB than hard drives, thanks to SMR and He offerings.

Also note that Backblaze does backups -- by definition, these are infrequently accessed, usually write-once-read-never. I've personally been a customer for three years and performed a restore exactly once.

guenthert · on Jan 26, 2021

Despite claims to the contrary, tape isn't dead just yet. They are still considerably cheaper than drives. An LTO-8 tape (12TB uncompressed capacity) can be had for about $100, while a 12TB HDD goes for some $300. Tape drives/libraries are quite expensive though, but that just shifts the break-even point out. For the largest sites, its still economical. Not sure, if backblaze is big enough (I'm sure they did their numbers). backglacier anyone?

StillBored · on Jan 26, 2021

And a number of the library vendors' libraries last for decades with only drive/tape swaps along the way. The SL8500 is on its second decade of sales for example. Usually what kills them is the vendor deciding not to release firmware updates to support the newer drives. The stock half inch cartridge form factor dates from 1984 with DLT & 3480. Given there have been libraries with grippers capable of moving a wide assortment of DLT/LTO/TXX/etc cartridges at the same time. Its doubtful if that will change anytime in the future. So if you buy one of the big libraries today it will likely last another decade or two, maybe three. There aren't many pieces of IT technology you can utilize that long.

birdman3131 · on Jan 26, 2021

I bought a pair of 12tb drives for $199 the other day and they often go cheaper. Now admittedly if you shuck externals you loose warranty but we are keeping them in the enclosures as these are for backups and thus the ease of taking them off site is great for us.

R0b0t1 · on Jan 26, 2021

They claim you lose the warranty but are wrong, they still have to prove you damaged it. Federal law: https://en.wikipedia.org/wiki/Magnuson%E2%80%93Moss_Warranty...

PhantomGremlin · on Jan 27, 2021

Are you asking for service for the drive itself or for the drive within an enclosure?

Let's say you buy a Ford Explorer SUV. You remove the engine and use it somewhere else for a few years. If that engine breaks within the warranty period of the SUV, can you take just the engine to your local Ford dealer and ask them to fix it? Probably not.

Arguably, you can put the engine back into the SUV and take the entire car to a Ford dealer for repair. But would that constitute fraud on your part?

codezero · on Jan 26, 2021

I was specifically thinking of the SKUs - I assumed they were using faster disks rather than high volume disks that make trade-offs for costs. Just assumptions on my part - and I am mostly curious for more data, but given the historical trends, I'm not terribly suspicious of the actual results here.

StillBored · on Jan 26, 2021

Drive enclosures, raid/etc interfaces, and motherboards burning electricity make it a lot more complex than raw HD's vs raw tape. Tape libraries cost a fortune, but so do 10+ racks of cases+power supplies+servers needed to maintain the disks of equal capacity.

Tape suffers from "enterprise" which means the major vendors price it so that its just a bit cheaper than disk, and they lower their prices to keep that equation balanced because fundamentally coated mylar/etc wrapped around a spindle in an injection molded case is super cheap.

thebean11 · on Jan 26, 2021

Backblaze has backup and raw storage S3 type services. I'm not sure what uses the majority of their disk space.

cm2187 · on Jan 26, 2021

But even if it is hot storage, do you touch all your files every day? It is bound that you accumulate over time more files that you never access.

codezero · on Jan 26, 2021

Yeah hard to say, I assume the diversity of their customers and normal distribution should statistic those patterns out but I have no clue :)

I also wonder if failures are relating to physical location on disk vs other things like controller or mechanical failures.

You may not be reading old files but you’re reading files, or not, so still depends also on over all utilization.

jeffbee · on Jan 26, 2021

That seems like a misleading aggregation. Their total AFR can have been affected just by mix shift from early death to mid-life. It looks that way to me from their tables.

einpoklum · on Jan 26, 2021

> If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year?

It can also be the case that newer drives this year are better than newer drives last year, while older drives are over a "hill" in the failure statistics, e.g. it could be the case that there are more 1st-year failures than 2nd-year failures (for a fixed number of drives starting the year).

numpad0 · on Jan 26, 2021

What about air quality? There are actually air filters in a form of a patch or a package similar to miniature mustard packets through which drives breathe. Supposedly those are super fine filters but toxic gas molecules might still pass through them.

brianwski · on Jan 26, 2021

> There are actually air filters ... through which drives breathe

Although the helium drives are more sealed up, which also might be a factor?

andruby · on Jan 26, 2021

I guess these Hard Drive Stats post cover disks used for their B2 service as well? Maybe the service mix is changing (a larger percentage being used for B2 versus their traditional backup service).

I'm not sure how more B2 access pattern would improve the stat though.

brianwski · on Jan 26, 2021

Disclaimer: I work at Backblaze.

> I guess these Hard Drive Stats post cover disks used for their B2 service as well?

Yes. The storage layer is storing both Backblaze Personal Backup files and B2 files. It's COMPLETELY interleaved, every other file might be one or the other. Same storage. And we are reporting the failure rates of drives in that storage layer.

We THINK (but don't know for certain) that the access patterns are relatively similar. For example, many of the 3rd party integrations that store files in B2 are backup programs and those will definitely have similar access patterns. However, B2 is used in some profoundly different applications, like the origin store for a Cloudflare fronted website. So that implies more "reads" than the average backup, and that could be changing the profile over time as that part of our business grows.

alinspired · on Jan 26, 2021

perhaps margin of error should be raised to accommodate this change of about 1%, although the set of drives under test is likely not the same between years