North American Object Storage Service Impact

creyes · on Aug 6, 2017

I designed/deployed a decent sized Meraki network about 4 years ago - at the time it was one of the larger full-stack Meraki networks that exists. An 11 site school district with the edge, all idfs, ap's, phones and at a a later point some of their cameras.

Meraki still thinks of themselves a startup, but they have these "uh-oh's" all the time. Random bad firmware that turns off the 5g channel on the MR42's. A DPI "upgrade" that blocked ALL SSL traffic (which at this point is basically all traffic). Their solution was always to try "beta" firmware... in production... in the middle of state-mandated online testing.

I was a huge advocate for them but at some point it's gonna be hard for me to keep recommending them. They're so excited about new features but really fail about 1) fixing bugs and 2) ensuring robustness. The "fail fast fail often" mentality really shouldn't work with critical infrastructure

tgtweak · on Aug 6, 2017

Ubiquiti (unifi) was very much like this back then too, and to this day still breaks stuff every update. They're getting better now but running a large site or multiple large sites was a constant game if whack a mole trying to figure out which firmware works best on which equipment.

I have heard similar stories to yours about meraki and that's what swayed the decision to just go ubiquiti since it's less expensive.

Pxtl · on Aug 6, 2017

Isn't Ubiquiti supposed to be he high-quality "do one thing really really well" prosumer brand?

Thats terrible to hear their crap blows up. Everybody says to upgrade away from D-Link and Asus to the Ubiquiti stuff to get rock-solid pro-quality home network.

Sounds like I'm just as bad off with the consumer stuff.

tgtweak · on Aug 6, 2017

Anything that receives 8 firmware updates per year is a risky thing to put into mission critical service.

It is getting better, I can say the last few unifi updates I did involved much less butt pucker than previously, and they were done to get new features not because of necessity.

chrishacken · on Aug 6, 2017

We removed almost every piece of Ubiquiti equipment from our network because of the quantity and severity of bugs in their products. For example: Disabling a port on Edgerouters will grey out the port in the UI, but it doesn't actually stop traffic from passing through the port.

I also have no love for how some features are only available via command line while others are only available in the UI. This also differs depending on what product line you're using. Pick one strategy and stick to it.

tgtweak · on Aug 6, 2017

What are you using instead now? Genuinely curious.

chrishacken · on Aug 6, 2017

We're nearly 100% Juniper on the network side. The PtP and PtMP equipment we were using from Ubiquiti is now 95% Mimosa. The only Ubiquiti products that I like are Airfiber 24's and that's only because no one else offers an alternative.

jacquesm · on Aug 6, 2017

Whenever people say they don't need backups because they are 'cloud based' I always wonder what they'll do when their precious cloud provider messes up. The chances of this happening to Amazon, Google or Microsoft are small but they're not 0, if it can happen to Cisco it likely could happen anywhere.

derefr · on Aug 6, 2017

I don't need backups of my (S3) backup, because it's more likely my personal backup process has a flaw such that it will backfire and destroy my data, than that it will one day save it from the inadequacy of Dynamo's ~RAID17.

Consider: each time you introduce a new device that has local, physical access to the place your data lives, that's one more thing that could Halt and Catch Fire at just the wrong time, or be replaced with a USB Killer or a DMA cryptolocker device by social engineering. If it involves data center operators you don't know, that's more people you have to trust not to break whatever they touch or have been paid off to steal your corporate secrets. Etc.

Sure, the probabilities are small—but so is the probability of the great data fortresses crumbling to ash and you being the Last Best Hope for your data. Hypothetical ameliorations of sub-lightning-strike probabilities often have failure modes more likely than their use.

jacquesm · on Aug 6, 2017

In that case I hope you have your S3 under a different account than your main stuff. There are more reasons why stuff goes missing than just hardware failure.

Note that a backup need not make things worse, but should only make things better.

icebraining · on Aug 6, 2017

Consider: each time you introduce a new device that has local, physical access to the place your data lives

Right. So don't do that. Put it somewhere else, and configure the original device to push to it rather than give the new device access to the original. You can use a service that implements the S3 API, then you don't even need to install new stuff on the original, just configure an extra endpoint. Also, encrypt before pushing (that counts for S3 too).

samfisher83 · on Aug 6, 2017

Do you think you will do a better job of backing up stuff compared to google, msft, etc.? They have dedicated engineers and spend lots of money on this stuff.

Think of it from a statistical perspective what is probability of you setting up this back up system vs them?

jacquesm · on Aug 6, 2017

Reasons why data can go missing:

- account compromised, wiped out

- operator error

- malicious employee

All of these have happened to companies that I have worked with, so no, I won't do a better job of backing stuff up comapred to google, msft, etc, BUT I would rather have some get-out-of-jail-free card if any of the above should happen and suddenly where there used to be data there is nothing.

You should approach this from a cost-benefits perspective, not from a skills perspective.

icebraining · on Aug 6, 2017

They may have better engineering, but they also have extra risks. My home server will never ban me because it thinks I've violated its TOS, for example.

toomuchtodo · on Aug 6, 2017

Nor will your own storage lock you out because you've annoyed a state actor, while a cloud provider will roll over.

vertex-four · on Aug 6, 2017

I actually really doubt that Google, Amazon et al have proper backups of every client's storage - I've never come across details or even an idea of such a system. They just have enough redundancy and, more importantly, a "never-delete" architecture - data is merely tagged for deletion for a significant amount of time before it's ever deleted, and various systems check consistency on an ongoing basis.

Of course, even that doesn't prevent you from fucking up - your datastore will do exactly what you tell it to. Nobody can prevent you from doing the equivalent of rm -rf on your S3 store, or accidentally deleting the only copy of that movie your client's been working on for the last four years, and nothing can protect you from it except a decent backup.

dx034 · on Aug 7, 2017

Not sure about GCP, but Google certainly has back-ups for GMail. I was affected by an outage where only a few accounts (maybe millions but at least not a lot by Google standards) had emails deleted due to a software issue. They explained that recovery would take a few hours because data had to be restored from tape. At least that's the message they showed when I tried to login. Note that this was the free GMail product, no business support.

rhizome · on Aug 6, 2017

Even though there is some reward for expertise, backups are not difficult. What exactly does Big-4 bring to the backup table that none of us with Amanda, rsync, or BackupExec could do?

Cost of resources aside, a person could run hourly full-backups all day every day and have just as good a backup regime as a billion dollar company. Time-to-restore is something that the aforementioned expertise factors into, but a good backup is the linchpin, and can still be restored by whatever means.

wongarsu · on Aug 6, 2017

Nobody said that you shouldn't have any data in the cloud. The argument is that you shouldn't have your data only in some cloud.

If you have your data in some cloud (either directly or as backup) as well as in your really crappy backup solution that has a 10% failure rate, you still are ten times less likely to loose your data than by just keeping your data in the cloud.

copperx · on Aug 6, 2017

It seems like storing your backups using two cloud providers is much more reliable than using just one cloud provider + local NAS.

jacquesm · on Aug 6, 2017

As long as - if this is a company - different people have access to those accounts, yes.

ams6110 · on Aug 6, 2017

It's a good point, but I can't help but think of all the mistakes, disclosures, privacy violations, poor design, gratuitous change, etc. that has happened at the hands of "dedicated engineers." In this case specifically, are the engineers who caused this incident not dedicated and well paid?

wbl · on Aug 6, 2017

Addition backups only add, not subtract, from reliability.

mbesto · on Aug 6, 2017

> Amazon, Google or Microsoft

Just to be clear, if you're using AWS, GCP, and Azure to host your own applications, you're at your own peril to managed disaster & recovery. Those companies make doing that much easier than managing your own DC and yes the reliability is going to be better than DIY (but still never zero). I think you mean more towards SaaS applications or anything that "phones home" data to back it up, right?

We're going to start seeing more business continuity audits of SaaS players, akin to a BBB rating for the company's ability to maintain service levels. I thought I came across a website that actually has started doing this, but I can't recall which it was.

tgtweak · on Aug 6, 2017

I think it's more about shifting blame than actually providing more reliable services. I regularly see s3 throw an error when reading and writing tens of thousands of files (spark w/parquet). It's not that it's MORE reliable (although it is very reliable), it's just that when it isn't it's somebody else's problem and responsibility to fix it.

top_coder · on Aug 6, 2017

One's cloud is someone else's hardware sitting on a basement.

api · on Aug 6, 2017

Centralization decreases frequency of failure but increases its cost. A really large scale nasty incident with a major cloud provider like Amazon could be a national state of emergency.

jacquesm · on Aug 6, 2017

It really is only a matter of time. The problem with low frequency events is that you never have any idea how realistic your modeling is and the only way you'll find out is when that once in a 1000 year event happens tomorrow morning.

inetknght · on Aug 6, 2017

> once in a 1000 year event happens tomorrow morning

It doesn't quite sound like a once-in-a-thousand-year event any more.

Rather, it sounds like once-in-a-thousand-year event for a single device, but divided by ten thousand devices means it happens ten times every year.

There's whole branches of statistics for failure rates.

jacquesm · on Aug 6, 2017

> It doesn't quite sound like a once-in-a-thousand-year event any more.

It still could be. Or do you expect such events to happen in year 500 only?

Spooky23 · on Aug 6, 2017

Perhaps... Until it happens again next week.

5706906c06c · on Aug 6, 2017

Remember the AWS S3 outage a few months back? There is a lesson to learn from that.

NoPiece · on Aug 6, 2017

Thankfully, they didn't lose/delete existing data during that outage.

5706906c06c · on Aug 6, 2017

Agree, there was no data loss, but those who relied on US-Standard solely with no bit replication were dead in the water for the duration of the outage.

Spivak · on Aug 6, 2017

Sure, but that shows up in the risk calculations when you're choosing a cloud provider. I imagine for just about everyone it was cheaper to eat the loss on the day of the outage than to spend the time/effort/resources to do it right. Especially when it made national news that it was Amazon's fault so nobody blamed the sites that were down.

jacquesm · on Aug 6, 2017

> Especially when it made national news that it was Amazon's fault so nobody blamed the sites that were down.

That's an interesting viewpoint. I really don't agree with it though. When your service is down that is your responsibility, never Amazon's. And when you lose data that is your responsibility too, not your cloud provider's.

Spivak · on Aug 6, 2017

What was the lesson? That no service has 100% uptime?

dx034 · on Aug 7, 2017

And that redundancy of data won't help you if the control server crashes.

org3432 · on Aug 6, 2017

S3 has had a data loss too sadly, a console UI bug lead to the wrong files getting deleted if I recall correctly. Yet another reason for frontend web tech to improve.

5706906c06c · on Aug 6, 2017

Factually incorrect, there was no data loss, availability was affected, however; https://aws.amazon.com/message/41926/

org3432 · on Aug 6, 2017

Not sure why you're citing some unrelated incident, I think you're jumping to conclusions that I'm talking about the one you cited, but I'm not. This one was not made public.

plandis · on Aug 6, 2017

When was that? Do you have the post mortem for it?

org3432 · on Aug 6, 2017

There's no public postmortem, it was in 2015 as I recall, it was handled internally and just with affected customers.

jacquesm · on Aug 6, 2017

There is some proof of this, comment by Scott Bonds, Mixbook:

https://www.quora.com/Has-Amazon-S3-ever-lost-data-permanent...

org3432 · on Aug 6, 2017

It was a bit surprising when I found out, but I think the really interesting part is that low quality web tech are the weak link in the chain. That was an eye opener.

SoMuchToGrok · on Aug 6, 2017

Not surprised to see this, I've run into many issues with Meraki devices over the past 2+ years.

Their support team is amateur at best; at one point I had 6 Meraki engineers working on a DHCP problem (yeah...DHCP) and their recommendation after several weeks of troubleshooting - do a factory reset.

I have dozens of stories...don't even get me started.

bogomipz · on Aug 6, 2017

>"The issue has since been remediated and is no longer occurring."

I would think that if you lost your data, unless they have restored your deleted data the "issue" is still very much occurring for you as a customer.

Wouldn't remediation be that they have recovered your lost data?

bartread · on Aug 6, 2017

That would certainly be my understanding of remediation.

bartread · on Aug 6, 2017

FTA: "Your network configuration data is not lost or impacted - this issue is limited to user-uploaded data."

Errrrrrr... so the issue was limited only to data I would actually care about then? Or did I misread?

That is a frankly extraordinary use of weasel words.

wmf · on Aug 6, 2017

Yeah, you misread. Configuration data is what makes the network actually work and that was preserved.

bartread · on Aug 7, 2017

Fair enough. And I suppose that release was meant for people with an understanding of the product, in which case that makes more sense. Still... interesting choice of words. "Oh, it's only the user-uploaded data that we've lost."

Spivak · on Aug 6, 2017

I imagine that from Cisco's perspective the network configuration data is what's actually important. If your priorities are different then I suppose you should be more worried about this incident.

madsushi · on Aug 6, 2017

Only losing user-uploaded data seems pretty mild, in this case. Meraki sells cloud-managed network hardware, so you don't interact with it that often. Network configurations, logs, traffic data -- any of those would've been much worse to lose. The custom bits like voicemail greetings and IVR will likely be the hardest to replace.

rsync · on Aug 6, 2017

Does anyone familiar with their marketing language know how many "nines" they had on their resiliency number ?

kyledrake · on Aug 6, 2017

As many extra as were punched into that command that permanently wiped out a bunch of AWS EBS data a while back.

It haunts me to think about how many people are using these services as their single source of data. Fat fingers melt through 9s.

oneplane · on Aug 6, 2017

And this is why outsourcing 100% of your stuff isn't the greatest idea. Sure, the 'managing servers is not your core business'-story holds up most of the time, but when you have no control of anything anymore, you no longer control your services.

rhizome · on Aug 6, 2017

I would think "business continuity" is a core function of, you know, business.

oneplane · on Aug 6, 2017

As do I, but that's not what marketing teams advertise for ;-) I suppose doing something like specific offloading makes for a great cloud case, but being able to run the baseline at home will at least guarantee a degree of control that will keep you going.