Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
LinkedIn shelved plan to migrate to Microsoft Azure cloud (cnbc.com)
187 points by helsinkiandrew on Dec 15, 2023 | hide | past | favorite | 270 comments


I'm obviously not going to comment on anything internal, and I'm obviously speaking for myself and not the company, but it's worth bearing in mind that this migration was not from "on-prem" in the traditional sense. LinkedIn has its own internal cloud, complete with all the abstractions you'd expect from a public cloud provider, except developed contemporaneously with all the rest of the "clouds" everyone is familiar with. It was designed for, and is tightly coupled to, LinkedIn's particular view on how to build a flexible infrastructure (for an extreme example, using Rest.Li[1], which includes client-side load balancing).

There was no attempt to "lift-and-shift" anything. There are technologies that overlap and technologies that conflict and technologies that compliment one another. As with any huge layered stack, you have to figure out which from the "LinkedIn" column marry well with those in the "Azure" column.

I personally appreciate LI management's ability to be clear-eyed about whether the ROI was there.

[1] https://linkedin.github.io/rest.li/


Oof, I'm twitching just reading that because we're in exactly the same boat. The problem with the ROI is that any kind of not-self-run cloud is guaranteed to be more expensive in direct costs. This has been shown time and time again for any reasonably large enterprise. However, there is a long list of things that are hard to express in money that support a cloud move, mostly to do with keeping up with modern tech, hiring, DR, better resiliency, etc. and so the decision can be quite dependent on the particular execs in the chain of command and their subjective values.


A lot of organisations were already doing “cloud” before the big migrations to AWS and Azure started in the EU enterprise scene. Around the 2010 it became a much better business case to get into the local cloud, which is basically where we bought our own hardware and put it in rented rack space at local businesses which were specialised in running it, instead of having it in our own basements. This mostly didn’t scale beyond the national, or even regional, level but then that is still not a big issue for most organisations today.

Then came the move to AWS and Azure, again because the business case made sense. A big part of this was Microsoft being good at selling their products in a bundle, so that getting into Azure sort of just made sense because of how well the IT-operations side benefitted from moving into things like Azure AD (now Entra ID for whatever reason), Azure Automation instead of scheduled tasks and so on. If your IT-operations infrastructure is already largely in Azure then your development will also be in Azure because it’s going to be such a tiny blip on your Azure bill that nobody will ever notice.

With the price hikes, and also just how complicated it is to actually run and maintain IT-Operations in Azure (I can’t speak for AWS) we’re now in a world where more and more organisations here in Denmark are looking to get out of Azure/AWS to reduce cost, and many have already made the move.

Of course no one is leaving Office365, so many of us are mostly just trying to broker better deals with Microsoft.


> With the price hikes, and also just how complicated it is to actually run and maintain IT-Operations in Azure (I can’t speak for AWS) we’re now in a world where more and more organisations here in Denmark are looking to get out of Azure/AWS to reduce cost, and many have already made the move.

It's an endless loop, I think. I've observed a very different behaviour in other parts of the world, where companies which either migrated away from the cloud already, or companies which never moved to the cloud to begin with, are now moving things into the cloud and other managed services in order to better focus on their core competencies.

Dropbox, for instance, shifted everything out of AWS in 2014 -- but are now moving a lot of their corporate infra back into AWS in order to focus on effective and efficient storage, which is what they're good at, without having to build everything else themselves. Likewise hedge funds - if you look at job postings for a lot of systematic trading firms (e.g. Jane Street, HRT, 2Sigma, etc.), you'll find loads of them are hiring extensively for people with cloud experience, because there's been a sort of wake-up moment where companies are breaking free of the idea that everyone can be Google, and realising that you can't build everything in-house if you only have 1000 employees.

So, interesting to hear of a move in the opposite direction in Denmark. I'm sure that ten years down the line, a new generation of engineers will be coming in and asking, "Why in the world are we managing all of this infrastructure on our own?"


Is colocation the cloud now?


At work, there is a push to move on prem applications to cloud but as much as possible without changing any of the code. We end up basically renting expensive instances for significant period of time and recreating all the on prem services on them.

So anyway, more like the reverse: cloud is colocation but expensive.


I'd argue taking on prem applications and putting them in large (expensive) virtual machines on AWS ec2 or azure windows vm directly while postponing or deprioritizing any refactoring is not the cloud either.


Yes. And no. Depending on what you mean by "cloud" how much kit you have in colo, and how you manage the resources of that kit.

You can certainly use colo to provide the facilities most need from cloud infrastructure.

Just stuffing a server in someone's rack isn't cloud, but larger installations are often managed in similar ways.


I am curious, which companies in Denmark are trying to get out of cloud?


Basecamp/HEY is one publicly famous example.


Another example is Ahrefs


This is based on the assumption that Azure has modern tech, hires well, DR, and better resiliency than LinkedIn's "cloud" for LinkedIn's needs. There's a bit of a problem around incentives here, where Azure is built to sell to Azure's customer base, whereas LinkedIn has evolved their own stack over the years.

The questions become:

1. Does it make sense to dump our special features in the stack, or move them to a higher level in the stack? 2. Does Azure have comparable capabilities for the LinkedIn stack? 3. Is LinkedIn worth it to Azure to sell to?

---

Often times, "at scale", you can support custom solutions outside of cloud providers that are purpose-built, and often times more resilient and efficient than the cloud providers.

AWS has taken a very interesting approach of building an incredibly wide set of solutions to support every customer under the sun, and their approach to being "customer obsessed" leads to them building super niche solutions if the deal is worth it.

I'm not sure how Google and Azure handle these engagements.


Google doesn't acknowledge that the customer exists or has wishes.

Microsoft will allow you access to the PM of the product if you are big enough, but his answer will be a not-exactly-committing "in 3 years" because their roadmap is full.


Every once in a while Google acknowledges the customer exists, but usually for a brief moment, either to close their account, or to tell them they are using the product wrong. They promptly go back to work after such brief outings, so as to not fall into the mistake of actually listening.


> 3. Is LinkedIn worth it to Azure to sell to?

Microsoft owns both, so that question gets convoluted. It's not really about whether LinkedIn pays Azure enough to care.


Telco is a really good example of how hard it can be to move to the cloud no matter how hard you want to. I imagine financial services are similar. Telcos built their own clouds because there was no other way to get the primitives that they needed (classic case: a floating IP that could appear in data centers on different sides of the country and be ready to serve clients in the middle of a call in under 100ms without dropping the audio). I mean, cloud just was not built for that.

The flip side is that it is freakishly expensive to keep doing that, and telco is not a business that is freakishly profitable.

I'm watching efforts like Azure Operator Nexus (https://learn.microsoft.com/en-us/azure/operator-nexus/overv...) which looks to be taking a stab at this, but even there, its more like OpenShift on specific hardware with some Azure dust, rather than a remaking of Azure primitives like redundancy for the telco universe.


Microsoft purchased tech and people from AT&T to help boost Azure's telco capabilities as a cloud solution[0].

I haven't followed what's happened since it was spun out, but I know that Network Cloud was heavily OpenStack based so I'd would be interesting to see how that may have changed but I no longer work in the Telco space either.

[0]: https://azure.microsoft.com/en-us/blog/improving-the-cloud-f...


Telco used to be all for openstack. They have all started the shift to containers and kubernetes


> direct costs

> vs ""modern tech"" FOMO

Very tough decision, I'll need two MBA degrees to figure this out.


1000% Agree. So many ROI calculations I've seen in the past ignore key-person risks from internal systems, hiring premiums needed to keep specialized knowledge, and also carve-away infra human costs into other departments. Once you carve out enough stuff from the ROI, you can sway the ROI calculation heavily.

The folks who are too senior arent sufficiently technical to understand. The folks who understand benefit from the key-person risk and premiums paid, and thus are disincentivized to challenge the ROI comparison.


> and also carve-away infra human costs into other departments

That's another great point; with the on-prem system where we run our own massive k8s/vm/data lake cloud, every team just throws things at it willy nilly and even though we know how to calculate total operating cost it's very difficult to set up backpressure to individual orgs and their budgets so there is quite a bit of waste I feel. With something like AWS the billing system is very detailed and is easy to attach to teams upfront. I'm not sure if it's good or bad that things are more efficient due to teams being incentivized to pay attention, but I kinda like it.


Key person risk should be understandable as redundancy to just about any exec…at least I would hope so.


It's not really 'the cloud' as much as it's a managed mainframe you allocate resources from. Only it's actually quite expensive to allocate resources but it becomes more palatable with a monthly bill compared to setting up on-prem.

Costs more money but easier on the cash flow.


Yeah, based on my own experience with AWS and Azure (that has nothing to do with Linked In), my immediate reaction to the headline was, "well, you can be keen on Azure, but "stuck" on AWS for a myriad of other reasons". Reading the article pretty confirmed it.


this describes github to a tee.

think very hard before you start building your cloud-native application on a specific cloud


I'm not sure how this has anything to do with AWS. But I guess you can use it to confirm your idea about vendor lock-in if you're bound and determined to convince yourself you're right. For what it's worth, working at AWS, I see a lot of companies who use Entra for AD and then federate into AWS using that.


I think you misunderstood OPs anecdote and took it a little personally?

They were simply saying that much as you can love one platform there are many valid reasons why your deployment on another makes it hard to shift that and recapture that configuration. Nowhere did they intimate it was a fault of the current platforms capabilities...


From that repo:

> At LinkedIn, we are focusing our efforts on advanced automation to enable a seamless, LinkedIn-wide migration from Rest.li to gRPC. gRPC will offer better performance, support for more programming languages, streaming, and a robust open source community. There is no active development at LinkedIn on new features for Rest.li. The repository will also be deprecated soon once we have migrated services to use gRPC. Refer to this blog for more details on why we are moving to gRPC.


If anything, this highlights the dangers of building your app around the details of a particular clod infrastructure.


I guess in this case it was the other way around: building the cloud architecture specific to the needs of the app.


It’s not a good cloud if they built the wrong abstractions that aren’t generic


> I personally appreciate LI management's ability to be clear-eyed about whether the ROI was there.

Kind of confusing statement as they threw away tens of millions of dollars in a failed attempt to migrate.


The sunk-cost fallacy is a brutal one. Better to cut your losses than throw good money after bad.


Sunk cost doesn't necessarily mean good management.

It actually means the opposite. Instead of management identifying a boondoggle ahead of time they got the company into a mess.


This kind of sentiment always shows up after the fact but the hard part is actually knowing it ahead of time.

Emphasis on knowing. I’m sure some people thought it was a bad idea ahead of time. But there are always people who think any new idea or big change is a bad idea.

The reality is that anything significant has to be tried out, for real, to find out. That’s the premise behind proven useful decision frameworks like agile or MVP. And as the GP says: good management shows up when they read the attempt clearly and resist the sunk cost fallacy.

Bad management sounds like “let’s never change anything” or “we’ve already come this far, we need to finish it no matter what.”

And to a large enterprise, tens of millions of dollars is not an existential amount of money to lose on trying something significant. Big sports teams lose that much on bad player moves all the time. “We can’t ever waste money on trying new things” is a great way to sink deeper and deeper into whatever rut you happen to be in at the time.


> Emphasis on knowing

Literally your primary job as an executive. Strategic knowledge and wisdom

If wasting tens of million on a bullshit initiative isn't an example of bad management then literally nothing is.

The only place these types of terrible managers can exist are too big to fail companies like LinkedIn and the Government where theres zero repercussions for poor management decisions because money keeps pouring it.

In smaller leaner companies where money is actually important executives actually have to be good at their jobs.

> "We can’t ever waste money on trying new things"

There's a difference between a good strategic decision and starting a "facebook for pets".

Some ideas are terrible at the outset.

And executives whole reason for existence is to make this determination.

Blueshift was an objectively terrible decision and whoever made that decision is objectively bad at their jobs.


It was MS buying LI and then wanting them on Azure. There's a huge Azure branding play here that - I'm guessing - drove this work. A distant second was some cost savings in the far future.

The hit to Azure's brand for this is probably significant. Who was looking at a migration to Azure before who's now discouraged?


> was MS buying LI and then wanting them on Azure.

LinkedIn's been extremely independent of Microsoft and still are just like GitHub and openAI.

This was a 100% internal LinkedIn management decision.


Could LI have selected a competitor?


LI is a monopoly in professional social networking.

there are no competitors


I meant a competitor to Azure


The only other option is their own internal stack.

I cant imagin being owned by MS they would be allowed to use Google cloud.


At least they canceled the project when it became apparent things weren't going to work out as expected.

I've seen massive projects that have been train wrecks since day one pushed through until the bitter end because management can't admit failure.


That's not good management though.

Recognizing you fucked up and only cost the company tens of millions instead of hundreds of millions is "Not Terrible" management.


It’s not perfect, but it’s good. Or better than spending more.


Id like to be your financial manager.


We have that going on in Norway right now. Our state hospitals are switching several aging systems to some Epic-based monstrosity, and so far it's been a complete shit-show[1].

A major hospital has been running it for a year with lots of issues, some which has certainly affected patients gravely. It was rolled out despite heavy protests from doctors and nurses who had seen or experienced the new system at the smaller pilot hospitals.

Yet management is pushing on, and just decided the next big hospital should get the new system this spring, despite the many severe issues that are still unsolved.

[1]: https://www.nrk.no/emne/helseplattformen-1.15838445


Any amount of employee time spent exploring options that aren’t pursued could be framed that way.

I think it’s a pretty successful operation when you can make the exploration and come to the conclusion that it’s not a going to work, then pivot.

Sticking to the decision no matter what would be much worse.


Id like to be your financial manager.


All a matter of perspective. Did you lose $10 mil to avoid losing $1 billion in wasted effort?

Totally worth it. Your entire org also learned in the process.


Its like buying both Blockbuster and Radio Shack stock while their stock was in freefall then selling right before they declare bankruptcy and being proud at how good of an investor you are for not losing more money?


I feel like you’re reaching really hard to make this work.

Companies, teams and people do research and test projects all the time to determine direction. Making a decision like moving the entire LinkedIn data center is just a larger version of that.


a management decision lost tens of millions of dollars.

you're saying they're good managers because they didn't lose billions.

going by your "logic" its literally impossible for management to make a mistake because there's always some slippery slope they didn't slide down.

"""we put the company out of business but at least we didnt cause the collapse of matter into a black hole! we deserve some bonuses!!"""

this is how people who work at monopoly tier companies actually think lol


That's actually not what I said at all. What I said was...

> Any amount of employee time spent exploring options that aren’t pursued could be framed that way.

Employee time costs money. If 20 people are in a meeting for 1 hour, that 1 hour meeting costs the equivalent of 1 hour of each of attendee's salary. 20 people attend a meeting making $100 / hour, you just spent $2,000 for that meeting.

All time spent planning, also costs money. Every Agile "spike" to do research for a story costs money. Every architecture document. Every time you have somebody do a couple of prototype projects to test your options before you make a decision. Sometimes you can't easily find out what's involved until you really dive into the work. It's the nature of the beast.

For LinkedIn, purchased by Microsoft who also owns Azure I'm quite certain that during the process they assumed consolidation within Azure would be on the table. If anything it was probably expected to be a huge marketing push for Azure to be able to say that LI ran on Azure so it's likely a management decision that involved a lot of pressure to make it happen.

Being willing to abandon that decision after exploring to see what was involved is a big deal. In many companies you'd have just heard platitudes like "well we have to do it, find a way" perpetually extending the project while taking away from more valuable initiatives.

I used to work for a business telco that was acquired by a residential telco. They were so convinced they could roll this newly acquired company into their existing systems that they ran off all of the staff who built the systems. 2 years later, they finally realized that business telco was a lot more involved but by then it was too late.

I understand the thinking that got them there, but despite a lot of people trying to stop them they proceeded forward until their hand was forced. This LinkedIn decision could have gone much, much worse.


> being willing to abandon that decision after exploring to see what was involved is a big deal

Yeah but spending several years and many millions of dollars isnt 'exploring'.

I'd say it's a solid fail for the management.

But we can agree to disagree on this. I see you strongly believe that losing tens of millions is good management.

For what it's worth there's many posts nowadays on blind about how the culture of LinkedIn has tanked and I'm assuming management probably believes they're doing a great job in that area too.


You're really big on putting words in my mouth on this.

> I see you strongly believe that losing tens of millions is good management.

Costs are relative to size. There are a lot of people involved, it's going to cost a lot of money. Microsoft's operating expenses are estimated to be just shy of $350 million per day. At that scale, 10s of millions is not as impactful.

Could it have been done better? Of course.

Could it have been much worse? Yep. Significantly.


This was cancelled over a year ago - which the articles notes and is old news. It was clear the effort would have needed a very significant push that would have required a large halt in product development and management wasn't willing to stomach it due to high growth in 2020/2021. Which made sense. But LinkedIn revenue growth has heavily slowed with the pullback in tech hiring and they had the space to do it and consider it optimization time.

Also as part of Blueshift the plan was to do batch processing first but LinkedIn had a culture belief in colocation of batch compute & storage, which is against the disaggregated storage paradigm we see now. IMO this led to some dragging of feet.

Source: Worked at LinkedIn 12 years, am a director at Databricks now.


Not only that but the Hadoop team literally had the guy who wrote the original HDFS whitepaper. Moving a service with that much in house expertise first never made sense. I worked on one of the original Azure PoCs for Hadoop, even before Blueshift and it was immediately clear that we operated at a scale that Azure couldn't handle at the time. Our biggest cluster had over 500PB and total we had over an exabyte as of 2021 [1]. It was exorbitantly expensive to run a similar setup on VMs, and at the scale that we had I think it would have taken over 4,000 - 5,000 separate Azure Data Lake namespaces to support one of our R&D clusters. I believe most of this "make the biggest cluster you can" mentality was a hold over from the Yahoo! days.

[1] https://engineering.linkedin.com/blog/2021/the-exabyte-club-...


Since LinkedIn is using Kafka quite extensively, I'm wondering whether Azures poor support of Apache Kafka is at least partially responsible for this.

Azure is developing its own inhouse message broker that they call EventHubs. And while it is able to speak the Kafka protocol it has some weird limitations and quirks¹. I'm wondering why they are so hellbent on developing their own inferior version of something when they could just use the real deal and host it as SaaS.

¹ No support for KIP-71 (Enable log compaction and deletion to co-exist), limited to 10 topics (Azure lingo: EventHubs) per broker (Azure lingo: EventHubs namespace), mandatory client config parameters that you must not forget to set, even though the documentation just calls it "recommended" (https://learn.microsoft.com/en-us/azure/event-hubs/apache-ka...), some weird protocol level incompatibilities that are nowhere documented, for example a consumer group state transitions to "Dead" after the last consumer left instead of "Empty", and auto scaling that only ever scales up, but never down.


Lots of Azure stuff is like that. Some big services internally at Microsoft refused to use their Service Fabric product, it couldn't handle the scale & complex deployment needs.


Even at a very small scale I found their Service Fabric platform painful to work with. For comparison, AKS is also quite wonky, but miles ahead of SF.


Their hosted postgres was pitiful until very recently too.


> Azure is developing its own inhouse message broker that they call EventHubs

This seems to imply Event hubs is some sort of upcoming beta or something. Event Hubs are one of the oldest parts of Azure IaaS - I think it was launched in 2014?


I didn't want to imply that its brand new or beta. Just that it is still being actively developed. For example the compacted topic support did get into GA in May 2023.


You could still run your own Kafka cluster just using their VMs, right? Or is EventHubs cheaper?


well, looks like you know what you're talking about.



My company has invested in moving to Azure except where we need to stay on Google. Apparently MS gave us a package on all of their products if we use Azure and it was enough to sway the execs.

We were then given the directive that everyone at my level would need to get some certifications so we could properly use Azure, assist the architects and more jr devs. It’s a good idea but my god the training is so poorly executed. I want to like Azure but it also seems like an uncoordinated mess.

Maybe I’m just a grumpy dev. Anyone else have a better and more positive perspective? Who has good training for certs such as the Data Engineer or AI Engineer?


> Apparently MS gave us a package on all of their products if we use Azure and it was enough to sway the execs.

For a long time this is why Azure has been beating GCP. GCP has been adement (not sure if still are) that prices are “what you see is what you get) and hope that small projects will suck you in. Microsoft however would wine and dine execs and then offer 50% discount on a xx million commitment.

I agree on training and the like. Azure feels like such a mess.


Can confirm that even at I work, a state government department in Australia, which would be considered a medium size enterprise in most aspects, our standard price is a discount of just over 50% off the published price. When I want to calculate the price of a solution, I use their calculator, and then basically divide it in half.

I assume if you're bigger, you probably get an even larger discount.


> GCP has been adamant (not sure if still are) that prices are “what you see is what you get.

Yes and no. GCP is going after the non-cloud products instead of the big cloud providers now. So going after Snowflake customers with BigQuery.


I'm pretty sure its snowflake that went after BigQuery.


>I want to like Azure but it also seems like an uncoordinated mess.

It's literally a "me too" cloud.

Azure is notorious for not being "finessed" out and god speed if you have to use AKS.


What are your main problems with AKS? I’m setting up a new greenfield project and have to use it. Would be nice to avoid done pitfalls.


I second this. Seriously, don't use AKS. It looks so nice in the docs. In practice, you'll run into so many issues, I'm convinced no one actually uses AKS in production for anything major.

If your apps are simple enough that you can get away with Azure Container Instances/Apps, use them.

If you need proper managed k8, use OpenShift. Microsoft has a partnership with RedHat and RH provides full support.


Some examples of those issues would be nice to back up your claim.



Alright that’s pretty bad but also 5 years ago.


Literally 4 weeks ago we had to migrated over to another host because we couldn't create another node pool.

I don't think the age of the article matters.


Interesting, thanks!


At my employer teams are preparing to extensively use hybrid AKS. I'd really appreciate hearing any of the issues encountered.


Microsoft uses AKS for their production workloads, but only for “minor” projects like “Office 365” which only has a “few” (200 million) users.

https://customers.microsoft.com/en-us/story/1536483517282553...


What on God's green earth makes you think MS is going to release their dirty laundry to the public about one of their products using another one of their own products?

Like there's a litany of literature of how AKS completely dropped the ball, you don't have to take anyone's word on it from this thread. The difference between AKS and other offerings is _striking_.


They come in different shapes and sizes. Stability of the host environment is the main thing. Don't have to take my word for it: https://movingfulcrum.com/horrors-of-using-azure-kubernetes-...


Architect here: I work in Azure; I love Azure; I totes feel your pain about Azure. Half of my solutions are hopscotching networks and NICs between services that should all just be one service. There are no obvious or repeatable patterns or configurations; every answer is "the right answer", which as an architect drives me up the f*king wall.

I know MSFT likes to sell that as "complete and flexible configuration" but more often than not it's like: if your prescription to use this one service is so rigid as to use these four other services in such a way then when not just package them completely as such?


When I was last working with Azure it was impossible to just do one training every half year or so, because you needed three completed courses to complete the certificate, but they changed them every year, so you would need to take three different ones next year. Nonsense.


We are working to migrate from Azure to on-prem because it's so bad. It is an uncoordinated mess indeed.


Any move to any cloud is going to depend on the environment you're coming from. I've been in on decisions not to use X, Y, Z ... doesn't mean there was anything wrong with them, we just weren't ready for that yet or had different priorities or the ever present weird deal-breaker issue / requirement.


You have to

1) Be comfortable routing traffic between on-prem and cloud over the internet or at least over a VPN, and

2) avoid the temptation to build your own platform (Terraform templates are a liability, not an asset!) and

3) Move tiny stuff first and bigger stuff later.

It’s amazing how many companies fail at that.


Doing things the right way and learning from before usually requires a certain degree of humility, and more often than not the person leading these projects is either required to be, or is deluding themselves to be a hotshot who can succeed in a bold new way.


You mean somebody who posts 3 bullet points on the internet and claims that they solve everything ? ;)


I mean, we invented the term Promotion Oriented Architecture for a reason.

It’s like politics; the best person for the job is a sufficiently experienced person who does not want it.


Resume driven development is another similar paradigm, where they don't wait to be promoted but instead skip around companies.


Exactly. The cost to retool for the cloud is not insignificant.


This is why AWS is better. They dogfood when it is hard.

I remember when S3 and EC2 were coming out. They tried to make us working on the retail website move year after year. Our excuses became their roadmap. We (developers working on not AWS) really didn’t want to move to AWS. It was 100% worse than what we had before.

“Network is slow”, “it’s not PCI compliant”, “the drive is unreliable”, “the load balancer is missing a feature”.

It took years and years. But they did it. Google and Microsoft don’t have the willpower to force this, and its why they will always be behind AWS.

When Google tries to sell you Google Cloud, remember they don’t use it for anything internally. They don’t think it’s good enough. So why should I?


... You think Microsoft doesn't dogfood?

Where do you think Teams, Bing, OpenAI, the billion internal systems they use and Xbox are hosted


A lot of Microsoft products have had years-long complaints finally resolved in the post-Azure world.

It seems while some of the features implemented pre-Azure were supported, but not fully working as intended because they were sets of features that weren't necessarily in use by Microsoft themselves. But they became an issue as Microsoft was now interested in it working properly after they started using that capability, or parts of the Azure infrastructure depended on it, or some big money customers "'how about'ed" these features. I'm talking minor but important aspects of larger products, or Interoperability.

Windows Server had a lot of life breathed into it post Azure. I'm too lazy to find specific incidents, or recall the past annoyances as a sysadmin of years past (PTSD?), but they absolutely dogfood and we are all better off about it. What they can never dogfood however is the pricing.

I'm k-12 education sector (almost 200,000 staff and students to our district) and we get what I would imagine one of the sweetest of discounts, and if we were to move all of our operations to Azure, our TCO actually increases to maintain the same level of service and availability.

That being said, the performance hits aren't that terrible, but it's the little things that add up quickly at scale.

Of course, as public funded k-12, it's sometimes a decision that gets made when having to decide on buying new things for the data centre (capital cost) vs. subscribing it to Azure (operational cost). The money for both piles comes from the same vault but there destination pile is loaded with implicit reasoning excuses. For example if we spend $200,000 a year on a service as a subscription, it's easier to get money for that versus requesting $400,000 on something that would last us 3 or 4 years.

It takes great leadership at the IT level to liaison this impact to the business movers and shakers, and sometimes that's not there.


Yeah reading that comment felt like the twilight zone. Microsoft is foremost a software company and they host everything on Azure. Amazon's software services offering is a speck of dust in comparison. Amazon cannot dogfood AWS more than Microsoft dogfoods Azure.


Well, to take it full circle, they don't dogfood LinkedIn on Azure. How about GitHub for that matter? Wouldn't that indicate Amazon can dogfood more?


I think those are different because they are acquisitions.

It doesn’t mean they will never use Azure, just that they’re being rational about what to use.

It would be very different if they designed GitHub today and didn’t use Azure.


Add to that, future product offerings determined by real world use cases based on the reasons why they don't or can't do it today.


I can’t imagine they use Teams internally.


Microsoft mostly gives different groups autonomy in what tools to use, where to host etc.

Most groups in Microsoft use the Microsoft stack as it's best supported and already compliant and works pretty well - so most groups use teams, office, Azure etc.

Groups are free to use other tech and may for example use zoom+slack (some groups do) or GSuite. This is also true for technology (e.g. a team can use yarn and flow instead of npm and TypeScript). It's just very rare they do.


This is not true. Microsoft uses teams heavily internally. Zoom slack and others banned unless it's a customer requirement

Source: msft employee


I engage frequently with the Canada Education team and office 365 national reps and they are wonderful and the way Microsoft uses their own stuff to interact with their customers is a true blessing. Teach by showing instead of telling.


I think the point is that the scale of requirement intake that came into AWS from the website is orders of magnitude more than whatever level of dogfooding that Azure might undergo from internal solutions. Not that they don’t dogfood at all, only that the massive amount of engineers moving one of the most complex websites in world’s history into AWS, and AWS taking them in as real input into the product, made AWS a lot better than the old AWS.


AWS is extremely overengineered for startups. In order to create a simple SaaS on AWS you have to invest much more time comparing to Azure. So, I only use their S3/Route53/Email services and do the rest on Azure.


We started on AWS in ~2014 but it got too complicated for us to tolerate. My latest AWS complexity trigger was trying to set up a public S3 bucket. It's almost like they want you to screw it up on purpose. We were mostly working with .NET/Windows Server so we looked at alternatives sometime around 2020.

Our stack today has us using AWS for domain registration & S3. We use Azure for everything else. We actually log into AWS by going to the Microsoft MyApps portal and authenticating via our AAD/Entra credentials. Microsoft's docs regarding how to set up SCIM/SAML auth to AWS are excellent [0].

In Azure, we use ~5 products: AAD/Entra ID, DNS, Azure Functions, SQL Server, Azure Blob Storage. That's it. There isn't really any VM/container presence in our go-forward infra. Everything is managed/'serverless' now. There are some VMs but they are supporting legacy systems that will eventually be turned off. We have ZERO custom networking. I couldn't explain how to manage an azure vnet to save my life. We don't need VPN tech anymore either.

Github Actions->Azure Functions is pretty much the end game CI/CD experience for me. I am not rocking this boat. I never want to try anything else. I've spend a decade of my life keeping some flavor of Jenkins alive.

Could we do all this "cheaper"? Sure. But at what cost? My mental state is something that is a non-zero factor in all of this. Keeping track of hundreds of moving pieces in production is simply overwhelming. It's unnecessary suffering. I need something I can explain in 20 minutes to a new hire if we want to remain competitive.

[0]: https://learn.microsoft.com/en-us/azure/architecture/referen...


Technically you could start off with just Lightsail if you want to start off as simple as possible.


But then you cannot update without tearing down and rebuilding. (Since it's based on Bitnami.) Unless there is something I'm missing.


>remember they don’t use it for anything internally

Google does use Google Cloud internally for some things. Source: I work for Google.


For anything critical?


You see a story how a company manages to build out their internal infrastructure such that there is no appreciable benefit to migrating the whole thing to the cloud, and your takeaway is that "AWS is better"?

I feel like you might not have read the article.


Microsoft is long game. Azure will be here long after AWS. Microsoft is like the Borg. Embrace, extend, extinguish. Resistance? Futile.


As someone who worked at LI.

They spent years and god knows how many millions TRYING to move to Azure with the Blueshift project..before pulling the plug. They hired armies of contractors.

They didn't stop by choice.

They stopped because their tech stack is a giant over engineered unmovable turd.


As a current employee, there's things I don't like, but the infrastructure is more custom than bad (far better than my last job)


Custom is usually bad, just takes longer to reveal the problems. Sometimes companies need to create custom things, the mistake is continuing to invest in them when a community project appears and doubling down for years until nobody in your org knows how it's done "normally" and nobody that you hire knows anything about your stack.


The entire company has no Q.A team.

The amount of bugs I troubleshot in that tech stack was staggering just for basic day-to-day stuff.

promotion based engineering means an engineer shits something out that looks cool in a demo then bails to the next cool demo project while the rest of us are stuck with the reality of the turd.


^ it’s not an azure issue but a LinkedIn issue.


Didn't LinkedIn create Kafka? Was that some of the overengineering?


Kafka was made 15 years ago.


Microsoft bought Hotmail back in 1997. Hotmail was powered by Unix servers until 2004, despite MS's best efforts to transition to their own Wintel-powered backend [0]. These things take time.

[0] https://news.softpedia.com/news/Windows-Live-Hotmail-Was-Pow...


This doesn't appear to be about Microsoft's cloud but rather Public Cloud.

The whole migration of LinkedIn from their own data centers to the public cloud (Microsofts) isn't going well.

It appears they are still going to operate on-premise for many things. Some things moving or have moved to the public cloud.

Isn't this more a shot at the public cloud for all the things than to any specific one?


Yes I came away with the same thing. It's The Register's modus operandi to use cheeky clickbait titles.


I don’t see anything that points to it being a general public cloud issue. And instead they talk about Azure software specifically as something that they couldn’t take advantage of, no?


I would not assume that it is a specific Azure problem from that statement. Many, many teams struggle to take advantage of cloud infrastructure because of habits and knowledge retained for operating the existing systems.

It’s possible given what they have, I t’s simply best to keep it on premise - at least to some degree. That would likely not be true with a successful re-architecture, but not everyone is up for that.


It may not be about the teams. For example, when you control the data center you can do certain things around performance and scale you can't do in a public cloud.

There are so many unknowns about how things are setup that it's hard to know.


It's incredibly difficult for a mature software business to justify infrastructure and tooling investments. This is why we think that startups are a haven for modern tooling and the largest legacy firms are ... well ... difficult.

The last 15 years possibly broke this rule by virtue of low interest rates, enabling the justification of large internal teams focused on modernization efforts which sometimes went as far as moving the state of computing forward.

I wouldn't be surprised to see legacy enterprises return to form now that interest rates are 7%


On premises hardware can hardly be called legacy. Clouds are way too expensive for startups; they are a milking machine for the big corpo.


Startups are typically capex constrained, at least until series C. Clouds are favorable to capex constraints.

Having personally tried to be cheaper than the cloud in 2013 through large hardware buys, negotiated contracts etc. I found that the roi relative to a progressively discounting cloud weren’t there.


Maybe I'm a bit contrarian on this one but once I saw data center, Azure, and the phrase "lift and shift" it filled in a lot of context for me. I spent a lot of my early to mid career participating in these strategies. They don't work. VM images almost always are different in some way, there's something one vendor provides that another doesn't - in general there's enough minute details that add up to make a series of mini-mountains in terms of blockers.


Yep, there are always differences. Just one thing I stumbled into recently was one of our program images that has long worked fine in AWS can't start in Azure because something their hypervisor does to the virtual address layout conflicts with the way that we remap .text to a huge page. It is both trivia and a showstopper.


Yeah, there is a vast gulf between "it works for us" and "every dependency was implemented strictly according to open standards and is therefore seamlessly portable". See also the joke of migrating between "SQL" databases.


4 years ago: "LinkedIn is moving to Microsoft’s Azure public cloud three years after $27 billion acquisition"

https://www.cnbc.com/2019/07/23/linkedin-is-moving-to-micros...


Most cloud migration projects at large companies fail. It usually takes 3 or 4 tries at least before all the necessary lessons are learned.


Our customers are really good at making a total shitshow of cloud and our jobs very easy as a result. Much sentiment over cloud looks to me like stick-in-bike-wheels meme. It's hard because you are making it hard.

Strategically, I don't think you try to help someone with cloud until they've burned out their ego trying some grand delusion. Even one bad actor can throw the whole thing off. Cloud is about doing more with less, so committee bullshit is cursed.

Once you've got your customer completely humbled by reality, they be willing to listen to reason and you will save so much frustration.

We've got a huge one in the pipeline right now. They've been trying to "go to the cloud" for about 5 years now and executive management is ready to reset the entire tech org chart over the lack of progress.

Cloud native is the solution but many technologists perceive it as the end of their careers. Anyone pitching a "cloud native" solution that still has container spam managed by the tenant owner is either incompetent or trying to protect their career at this point.


Sad but very true. When careers have been crafted on the current architecture of a company, it's hard to shake.


My company migrated to Azure few years back. They can give good bulk discounts but on the flip side the experience with some of their infra and AKS has been choppy at times, with the support team taking time to fix or rca h/w issues. They do come back though. Would love to know how does it compare to long term experiences with other major cloud providers.


I've been with all 3 major and Azure is the only one that even feels like it has support. Even so, unless you are a big customer on some special agreement, you aren't going to get any red carpet treatment. We pay for the $100/m support plan right now and it's pretty goddamn mediocre. Maybe submitting tickets for "legacy" AD domain services outages doesn't touch the rockstar support team these days. Support quality is probably variable across product team at some level.


How do you move that much data over to another cloud provider?

Without losing data or disrupting the customer?

Or do the databases just stay in the data center and not migrate.


Roughly, pick a date and start writing all new data to both, while running etls to backfill data from old to new. Once that is done, you use a feature flag to do a small % of reads from the new system and wait and pray. If nothing major pops up, you slowly ramp up the % of reads until you're confident (as much as one can be) that everything is working, then you move 100% of reads to the new system. Finally, you turn off writes to the old system, clean up the feature flag and remove any old unneeded code.

jk of course you never do any of the other stuff after turn off writes to the old system you just leave the FF turned on to 100% and never think about it again :-p


I did a migration like this before intercloud between es clusters. Just wanted to additionally confirm that this is broadly the way that I've seen it done before.

If you can afford downtime, or data loss that does make it easier. But that's not an engineering question, more a product one.


We (when I worked at LinkedIn) did it with ETL clusters, we already had built them out for moving data between datacenters nightly. They would mirror an HDFS cluster, then ran batch jobs to transfer either directly to the outbound cluster or to another ETL cluster in another DC.

We used one of our ETL clusters to ship data to MSFT for various LinkedIn integrations, like seeing LinkedIn profile information in Outlook or Office products.


Which tools were you using for ETL? Or were they completely custom?


Live replicas (perhaps initialized with a cold backup, initially, if the dataset's really huge), carving off parts of it for separate migration if that's at all feasible, and some expensive folks doing a lot of butt-clenching-worthy activity for an hour or two (unless it goes very poorly...) for the final cut-over, some evening.


There are plenty of issues with Azure, but LinkedIn is hardly at the vanguard of innovation. And that was still the case before Microsoft vastly overpaid for it.


I left LinkedIn 1.5 years ago. I was there 12 years. I saw the revenue & profitability growth that occurred post acquisition. I am very very confident LinkedIn would be worth north of $100B on public markets today and Microsoft made the acquisition for $26B. You might argue that in the subsequent 6 years post acquisition that wasn't enough growth and they should have bought back shares instead but it was completely a debt financed acquisition and very high ROI for Microsoft.


Adjusting for inflation 26B from 2016 is worth 32B now. Going by just market returns 26B in S&P 500 in 2016, would be worth 70B today.

Also 26B is just initial investment, MS surely invested more money in the division in last 8 or so years, Linkedin was not exactly highly profitable entity in 2016, while it was not burning a lot of money, growth experienced in last few years would have needed additional investments in the business.

I don't have specific opinion on whether MS overpaid are not, just want to point out even 100B valuation today does not necessarily mean its high RoI for MS yet.


LinkedIn was already very FCF positive. They tightly managed margins to get to net income positive (account for dilution and so on) but it took maybe 2 years after the acquisition.


Of course, they were and are healthy company, i was just trying to say they while very healthy it was not so profitable that just by using the free cash flow MS could have grown the business to today's size, it likely required external cash infusion and therefore $26 Billion is likely not the only money MS has spent on Linkedin.

As far as M&As go, it is very successful outcome, vast majority of them fail spectacularly, RoI is not perhaps the right metric to judge a strategic acquisition.


Financially, perhaps the numbers made sense. It was (and still is) a very basic social networking site with poor UX. Microsoft paid for a user base and a rudimentary surveillance system. It would have probably been better for them to just start from scratch. I don’t know a single person who likes LinkedIn. Most see presence on it as an obligation for certain industries. Most of the commenters on there are far worse than what you see on Facebook. It’s usually just one long stream of retired or underemployed old men complaining about “wokeness” and environmentalists ruining the world.


Sounds like they would've faced a similar set of issues moving to AWS or GCP.


why would you ever want to move to cloud if you have a functional non-cloud setup?

Cloud platforms charge you much more than what the hardware is worth, the only advantages are that they provide a simplified resource management system (which is mostly an illusion, as any non-trivial system will require you to build in-house tech regardless) and the ability to scale more easily (which is done at a prohibitive cost, so a good reason to switch away if you need to scale up).


I wonder what sort of scale do LinkedIn operate in terms of Server count.

And Github also under Microsoft seems to be doing fine with on-prem as well. Why force LinkedIN to use Azure?


When I was there it was in the low hundreds of thousands. Probably more as growth was still in double digit percentages per year of user base.


>When I was there it was in the low hundreds of thousands.

Blows my mind every time I see these kind of numbers.


Indeed. According to this in 2022[0], Stack Overflow is still running off of 9 web servers.

[0] https://www.datacenterdynamics.com/en/news/stack-overflow-st...


Social media org like Linkedin is hardly comparable with SO style KB the R/W workloads are much higher with Social media, than a KB where vast majority of the transactions can be cached, Wikipedia would be a better comparison to SO.

Also Linkedin is 100x the size in users : 1B vs just around ~15M for StackOverflow.


There are 1B user on LinkedIn ? WOAH !

All of a sudden the scale makes much more sense although I will have to fact check this because the number seems absurdly high for a professional platform.


Here you go, they hit 1B milestone last month

https://news.yahoo.com/linkedin-now-1b-users-turns-125845570...


Thanks. Considering Facebook only has 3B user worldwide. I say a social media platform with 1B user is extremely impressive.


DAU/MAU is the more accurate metric to track by. Not as many people would log in to linkedin everyday also people create a profile and never log in again or only login when they are looking actively for a new job.

Facebook itself has been having significant loss of engagement in recent years. Instagram and WhatsApp are far more likely to have higher DAU than FB main app.


Quick math: 10000 users per machine?

Not considering that 90% of all users are likely sleepy accounts.


Rumor has it, half of them are busy sending out all those millions of unread emails I have in my inbox, about how my “career is on a roll”.


FYI even at their scale the headcount cost is greater man hardware cost.


If I had to guess, there are hordes of businesses out there that maintain operations on prem, and a large lift like this is great for the resume.

Of course, I could also be entirely wrong, but I also am not going to pretend that IT resume padding then jumping ship and leaving a shart of an architecture behind doesn’t happen all the time in this industry.


$$$


Only really benefits a company if they lack the technologies or cloud solution.

Not like they are paying anyone to host something besides themselves.


I guess Azure has run out of hardware. AI requires too much computation power.


If their stack works fine and I assume it it why the fuck do they need to move it at all?


Well they decided they didn't need to. That being said, you can't necessarily wait until things no longer work fine before deciding to move. If you can reasonably anticipate a problem in the future, sometimes even if things work fine today you might be best off moving in order to avoid future pain.


>"you can't necessarily wait until things no longer work fine before deciding to move"

You mean cloud is the holy grail where everything works and everything else is a failure waiting to happen? I claim BS.


No I don’t mean that at all, which is why I didn’t say that.

What I mean is that in planning anything in tech you can’t always wait until problems arise before you take action to deal with them, particularly if the consequences of failure are extreme. The specifics of what is the right setup are going to depend heavily on the circumstances of the use case. Definitely being on cloud isn’t right for everyone, nor is trying to maintain on-prem hardware the right choice for everyone.


> why … do they need to move it at all?

Because they’re owned by Microsoft and it looks bad if Microsoft doesn’t dogfood its own cloud infrastructure. Plus it’d be more economical long term if they didn’t invest in a parallel cloud infrastructure. They will migrate eventually, just apparently it’s not ready for them yet.


>"it looks bad if Microsoft doesn’t dogfood its own cloud infrastructure"

Their cloud infra is already mostly Linux. Bought company which already works fine not gonna add insult to the injury.

>"Plus it’d be more economical long term if they didn’t invest in a parallel cloud infrastructure"

This is big if. It may or it may not. Unless you are deep into their internal kitchen and have all the info your suggestion amounts no nothing more than a wild guess.


LinkedIn doesn't have competent people. Anyone who has peaked behind that curtain sees they struggle with very simple things.


I don't have deep intel, but all the ex-LI people I've worked with or met seemed pretty competent.


Isn't LinkedIn owned by Microsoft?


> LinkedIn was having a hard time taking advantage of the cloud provider's software. Sources told CNBC that issues arose when LinkedIn attempted to lift and shift its existing software tools to Azure rather than refactor them to run on the cloud provider's ready made tools.

I think I need this translated back into tech-speak.


“Lift and shift” is a term for when you move to “the cloud” but really just replace your physical servers with clones in cloud VMs. It’s a relatively cheap (in terms of effort) way to get on “the cloud” but gains you basically zero of the benefits. The term’s in wide use, talk to anyone involved with cloud-anything and they’ll be familiar with it.

I’m not sure what else needs to be translated? Nothing, I think?


Lift and shift is a sales term to make it sound like the internal team is trying to over-complicate the migration. The sales guy will normally phrase it as "just lift and shift."


I love it. I could totally believe the etymology starts as slick sales persuasion trying to downplay the implementation difficulty of something that's being sold.

And then people also pick it up for non-persuasion, because it also sounds like a catchy name for an engineering approach we already had.

Of course it can still be used for persuasion for awhile, but will grow baggage over time, as efforts linked to the term don't play out that way.


Thanks for the explanation, but no need to imply that someone is out of the loop if they didn't know it.

The term didn't sound familiar to me (though the concept was), and the term might not have been familiar to some others.

People might not want to contradict an assertion because of language like "The term’s in wide use, talk to anyone involved with cloud-anything and they’ll be familiar with it. [...] I’m not sure what else needs to be translated? Nothing, I think?"


> People might not want to contradict an assertion because of language like "The term’s in wide use, talk to anyone involved with cloud-anything and they’ll be familiar with it. [...] I’m not sure what else needs to be translated? Nothing, I think?"

> > LinkedIn was having a hard time taking advantage of the cloud provider's software. Sources told CNBC that issues arose when LinkedIn attempted to lift and shift its existing software tools to Azure rather than refactor them to run on the cloud provider's ready made tools.

The only other terms I can see that are jargon are "cloud provider" and "refactor", and those are already technical (more or less) so don't need to be translated into technical language.

As for the other bit, I just meant that it's a widely-used term so one may continue to encounter it in these contexts. It truly is ubiquitous in discussion of and around "enterprise transformations" to the cloud, and among cloud practitioners more generally, so anyone connected to that space will know what it means. It's also kinda already a technical term, in that developer/devops and SRE sorts throw it around and do mean a specific thing by it, which doesn't need to be translated for other technical folks in that area.


"Ten Thousand" https://xkcd.com/1053/

The original person might've instead asked for an explanation in a way that didn't come across as criticizing the article.

But probably best not to insist that everyone should already know the term; just explain it.


Yeah, you're probably right. Feedback received.


Lift and shift is a cloud migration strategy which involves moving your applications to the cloud with little to no modification. For example, you have an application running on a server in your data-centre, you then deploy a VM in the cloud with a similar spec and install the application.

It's usually done to avoid the engineering cost of making the services more cloud native. What tends to happen a lot is that after a considerable portion of the migration is completed, the cost of the lift-and-shift effort start to overtake the savings, and the projected costs, dwarf the future savings.

I suspect this is what happened with Linkedin.


which savings ? It’s never been obvious to me that cloud was cheaper if you’re a large company


It really depends on workloads. Imagine you need massive spikes of compute for, say, flash sales, or people watching the superbowl in your streaming service. Buying all that hardware for just the spikes might not make sense vs just scaling up vms in a cloud provider and scale them down.

In the real world, for baseline load, the big advantage for many large companies isn't price, but the massive lack of alacrity of many inhouse ops teams. If it takes me 3+ months to provision compute for the simplest, lowest demand services (as is the custom in many large companies full of red tape and arguments about who bears costs), letting teams just spin up anything they want and get billed directly is often a winner, even if it's more expensive. Having entire teams waste months before they can put something in prod is a very different kind of expense in itself.


The simplest example is if you have on-prem hardware, you need to have capacity for your peak load. In a lift and shift, you would replace your fleet of 96 core xeons with a fleet of 96 core xeons in AWS.

The cloud native approach would be to modify your app so that it can be scaled up and down so you keep a few machines always running, and scale up and down with your traffic so you only run at capacity when you need it.


The hiccup with this arithmetic is that the big cloud providers charge 7x to 10x the price you’d pay for an on-premises VM.

Sure, sure, you’re about to say something about discounts? Granted, that’s available, but only for commitments starting at one year or longer!

Okay, fine, I actually agree that there are savings available by reducing head count. The entire network and storage teams can be made redundant, for starters. Even considering that DevOps and cloud infra engineers need to be hired at great expense, this can be a net win…

…but isn’t in my experience. Managers are unwilling or unable to make many people redundant at once, so they stick around and find things to do…

… things like reproducing the mess that kept them employed, but in the cloud.

I’m watching this unfold at about a dozen of my large enterprise customers right now.

Got to get back to work and send the fifty seventh email about spinning up a single VM. Got to run that past every team! It’s no rush, it’s only been about fourteen months now…


This doesn’t demonstrate anything about the savings.

Anecdotally, when my previous company was looking at costs, cloud unequivocally came out significantly more expensive, and that wasn’t even a large company (only 2,000 or so employees).

I will grant that we did not have globalization problems to solve (but I’d also wager that lots of businesses prematurely “what if” this scenario anyway).


> This doesn’t demonstrate anything about the savings.

If you neeed 4 CPUs for your peak load for 4 hours per day, and only 1 of them for the other 20 hours a day, you can save by scaling down to 1 cpu for 85% of the day.


This assumes a lot about the cost of those CPUs and related resources in the respective environments. Cost per equivalently performing unit on a managed server vs. cloud instances is often vastly different.

It's also extremely uncommon to have loads that spiky.

And when you do, hybrid is often a solution (use a provider that can provide colo or managed servers for your base load and cloud instances for your peaks, or scale across providers).


It's pretty hard to capture the nuance of any possible solution in 2 paragraphs without someone coming along and picking it apart. The guy I replied to didn't know even the most basic information.

Even at that, you said yourself that you can use "cloud" to scale into your spikes.


Yes, but ironically being prepared to handle spikes with cloud tends to make it even less cost effective to do so, because it means you can plan for far higher utilisation of your on prem/colo/managed servers with little risk.

It takes very unusual load patterns for cloud to win on cost. It does occasionally happen, but far less often than developers tend to think.

There many reasons to choose cloud services, but cost is almost never one of them.


> It does occasionally happen, but far less often than developers tend to think.

Guy asked what is a cloud workload, I responded. Nitpicking every tiny detail doesn't help.

> There many reasons to choose cloud services, but cost is almost never one of them.

It's cheaper to pay me to manage IAM roles for lambdas and ECS instances for 5% of my time than it is to pay someone full-time to manage some sort of VMware or other system. It's easier and cheaper to find someone with experience with AWS who can provide value to the team and product than it is to find someone who can manage and maintain a cobbled together set of scripts to update apps. There are click and go options for deploying major self hosted services like grafana, k8s with secure details that I can use without spending any time (and time == $$$) learning about the developers preferred deployment scheme.


This isn't nitpicking, it's why the cloud option is very rarely cheapest. I've costed this out for many organizations over the years and tested the assumptions.

> It's cheaper to pay me to manage IAM roles for lambdas and ECS instances for 5% of my time than it is to pay someone full-time to manage some sort of VMware or other system.

True, but it's a false equivalence, and one I often see used from people unaware of the ease of contracting this out on a fractional basis.

I used to make a living cleaning up after people who thought cloud was easier, who ended up often spending a fortune untangling years of accumulated cruft that just never happened for my non-cloud customers.


Compute is at a premium, but you can shift opex/capex around which might be more suitable. It can also be cheaper in headcount since you need fewer operators and less expertise in datacenter operations.


> you need fewer operators and less expertise in datacenter operations

Because you are paying someone else for them.

This is considered rational because those operators are presumably more productive in a pool of people using similar skills to support many customers rather than just one. It is similar to hiring a cleaning service rather than employing individual cleaners in a department of cleaning because cleaning things is not a core competency of business.

It might be less irrational if some amount of compute is part of the core competency of the business. Since "software is eating the world," compute is a core competency of all businesses except for the ones that don't realize it yet.


I think there is a difference in competency between using Cloud Compute and running your own Datacenter. Perhaps some companies have the overlap, but I suspect this is an additional skillset they need to cultivate to achieve the savings.


> It can also be cheaper in headcount since you need fewer operators and less expertise in datacenter operations.

I've not really seen this work out well. I think it might be true for simple set-ups, letting a tiny developer team also handle infra and support without going nuts doing it, if they set it up that way from the beginning, but more-complex setups always seem to have so damn many sharp edges and moving pieces that support ends up looking similar to what a far more DIY approach (short of building one's own datacenter outright) would, in terms of time lost to it.

... and so does downtime, for that matter.


When I've done devops consulting, clients with cloud deployments invariably spent more money on devops because of the often significantly more complex environments.


It's at least more predictable. You don't pay for staff with datacenter skills (sort of in short supply) and you don't need to make large investments early on to build the datacenter and you don't have a huge headache if you need to scale up or down operations.


It’s easier to scale cloud infrastructure.


Even if you never need to scale it's cheaper to not have to physically maintain your own data center. If all the broken server, building power, building internet, access control, real estate costs... are all handled in the cloud there's savings there as well.


But those costs don't go away -- the cloud provider is going to charge you for them, along with a premium for profit?

I'm used to organizations moving out of the cloud when they realize that it's more expensive if you don't have very peaky load demands.


Very few people who don't use cloud use their own data centers. The dominant alternative to cloud is colo or managed servers.

And it's difficult to make that as expensive as a cloud deployment.


But that's somewhat negated if you lift and shift, because your application is not designed to leverage that capability in that way.


Although, it must be unusual, right? This is not one company porting their service to the cloud, this is Microsoft porting their LinkedIn service from whatever servers came along with LinkedIn, to their own servers, on which they also run a cloud business.

Which… isn’t to say anything about which way we should expect that to swing things. But it seems quite unusual, as most companies have not been bought by a cloud provider. Yet…


> this is Microsoft porting their LinkedIn service from whatever servers came along with LinkedIn, to their own servers

Nope, LinkedIn executes completely independently.


That is not really true, if it was then Linkedin would have evaluated all the clouds in fair process and gone with AWS as the best fit in 2019 if they wanted to move in the first place.

The pressure to move to a cloud and to Azure specifically both comes entirely from MS. Linkedin was perfectly happy then and now running its own setup, this 4 years of trying to move is because of MS ownership.


If your architecture is chatty enough, you will be sharding things so that most traffic stays in one rack, room, or data center.

If you treat us-west-1 as a single data center, you may find you are spending a lot on traffic between AZs.

A lift and shift might treat us-west-1 like a single data center. A more sophisticated strategy might treat it as three.


There is no such thing as "lift and shift". It is something Azure account reps like to say to make it sound like moving is easy. It sounds like you're picking up some boxes from one side of the room and moving them to the other. When in reality you're rewriting your infra code mostly from scratch.

When we were acquired by MSFT we had the same project. We had to move from AWS to Azure. I made them all stop saying "lift and shift" because in reality it is "throw away all of your provisioning code and rewrite it using Azure primitives which don't work the same way as AWS ones".

It is more akin to writing an iOS app to work on Android.


"Lift and shift" isn't just an Azure-specific phrase. Many people use it pejoratively, and point to it as an anti-pattern, and something to avoid.

Similar terminology is "forklift"... been hearing that one for well over a decade.

Migrations are oftentimes an opportunity to revisit scaling, configuration, build and deployment pipelines, platform primitives, etc. Every migration I've been involved in has a (probably necessary) tension between getting the job done efficiently, while not repeating all the mistakes of the past.


“Lift and shift” came into the conversation once we started talking about how we were paying too much for AWS. The obvious stuff was things like less bin packing, and bandwidth for third party services, like telemetry dashboards.

And it’s not just the service fees. I blanche to think of the opportunity costs we accrued by focusing for that long on infrastructure to the exclusion of new product and features. It’s truly breathtaking.

And then there’s the burnout, and the ruffled feathers.


I've become convinced that most migrations are absolute losers in terms of opportunity costs.

Even if done skillfully with valid rationale, they don't show any value until you come out the other side successfully.


Definitely. We migrated to a new telemetry vendor and I'm pretty sure it'll take 10 years for us to recoup the cost savings in man power and opportunity cost.

They were worried the old vendor might go under. My own track record with predicting company failures is pretty bad, so I suspect they'll still be around ten years from now.


The IaC has to be rewritten, but often the application itself needs major modifications due to pervasive use of proprietary managed services. Vendor lock in is a major problem. It's almost never simple unless an app is entirely running on a standalone VM. And if it is, you're probably wasting money running it in the cloud anyway...


I'm gonna bet that many Azure customers had no such thing as "provisioning code".


But lift and shift is not that, is it? It's having applications running directly on OSs (without containerisation or separation of dependencies like the database or physical disks) and moving it to "the cloud" to be ran on a VM in the same fashion.

I mean, if you're already with AWS using their services (besides EC2 for hosting) such as RDS or S3; moving to Azure SQL (or DB for MySQL or whatever) and Blob Storage is not just lift-and-shift anymore, since you are actually changing from a cloud provider to a different one.

AFAIK an actual migration to the cloud would involve rewriting some parts of the application to be cloud-native, such as using Service Bus for queues instead of a local Redis/RabbitMQ instance, using GCS instead of local disks, and using RDS instead of hosting your own single MySQL server.


There's no formal definition of "lift and shift", certainly nothing that would dictate specific virtualization strategies.

I've always read it as being roughly analogous to "like for like," and dependent on the specific circumstances and status quo.


To be fair, AWS also used the exact term when we moved a project out of a tiny expensive to operate (though lack of scale) datacenter that only hadn't been retired because we had a 30+ year old COBOL app suite on a z system.


ELI5: For any sufficiently complex enterprise system (e.g. LinkedIn, or Google), any plain vanilla architecture is infeasible for lift & shift. Moreover, the vanilla services may not comply with internal security requirements, or play nicely with internal CI/CD tools, or internal databases / data structures / data processing pipelines / analytics.


No five year old is going to understand a single sentence of that.


You spend a week building a castle with legos, and suddenly your mom asks if you can change some parts to use the new <lego competitor>. You can try to make the old and new parts fit together, but it isn’t going to be easy most of the time, and you can’t be certain that the lego competitor will have the same pieces or do the same things as your lego version. By the time you are done redoing those parts, you’ll end up having to recreate large portions of your castle to make everything work together again, and even then you might miss something important that breaks a functionality of your castle.


Maybe in this case "ELI5" stands for "Explain it like I have 5 yrs experience in software development but gave up in year two"?


ELI40YO Engineer


Still need it dumbed down a bit /s


Arguably, if the five year old was reading hacker news, they might. :) Point taken, though, but honestly this doesn't seem like the place to simplify things to quite to 5Yo level.


My five-year-old loves her complex enterprise system.


That's all correct in theory, but in my experience these things still happen and are usually outgrowths of bad engineering culture / shadow IT / not wanting to be reliant on your cloud infra / platform team (often for irrational reasons, sometimes not). They get built with entire teams taking responsibility on paper, but then before you know it, nobody from that team still works at the company or on that team. Usually these systems are also GDPR nightmares if they contain user data, because these people don't understand when you tell them they need to have a plan for deleting user data. They don't even consider it a legal barrier, they think you're putting stones in their way.

I've been on enough Cloud Archeology expeditions into the land of VMs where nobody knows what they do, it might as well be my job title now.


Is that supposed to make it seem better?

If refactoring is too hard for a Microsoft owned company what am I to think about my tech stack?


Beyond ludicrously small systems, refactoring of live 24/7 production systems is never easy.

Reality has a surprising amount of detail, and any non-trivial, customer-facing system will have accumulated weird code paths to account for obscure but nonetheless expensive edge cases. A codebase built across >20 years, scaled to support millions of concurrent users is going to be absolutely filled to the brim with weird things.

When you add the need for live migrations with zero downtime, done every few years to account for next order of magnitude loads, you end up with a proper Frankenstein's monster. It's not called "rebuilding an airplane while flying" for a lark.

Every round includes a long, complex engineering effort of incremental live migration. With parallel read/write patterns between old and new systems, and all their annoying semantic differences. And then, to add insult to injury, while your core team was going through the months-long process of migrating one essential service, half a dozen upstream teams have independently realised they can depend on some weird side effects of the intermediate state and embedded its assumptions to a Critical Business Process[tm] responsible for a decent fraction of your company's monthly revenue. Breaking their implicit workflow will make your entire company go from black to red, so your core team is now saddled with supporting the known-broken assumptions.

Then you get to add wildly differing latency profiles to the mix. While you were running on your own hardware, the worst-case latency was rack-to-rack. Implicit assumptions on massive but essential workloads may depend, unknowingly, on call-to-call latencies that only rarely exceed 100 microseconds. In a cross-AZ cloud setting you may suddenly have a p90 floor of 0.2ms. A lot of software can break in unexpected ways when things are consistently just a little bit too slow.

Welcome to the wonderful world of distributed systems and cloud migrations. At some point the scars will heal. Allegedly.


That is tech speak. They tried to redeploy existing architecture into azure and it failed


The headline notwithstanding, this doesn't seem like anything particularly Azure specific. They'd likely have had many of the same issues trying to mostly lift and shift to any of the big public cloud providers.


It could mean multiple things. My guess is they used vendor specific services that don’t translate as well as the basic build blocks like vanilla S3/ec2


I think this is an awkward way of saying they tried to add an abstraction on top of their AWS dependencies so that their services would work on Azure without a refactor.


They don't use AWS, but primarily baremetal.


Also, this reminds me of the time Microsoft bought Hotmail, and couldn't port it to WinNT. They had to leave it on its BSD variant for a long time, NT couldn't handle it.


Arstechnica forum discussion on the topic in 2001:

https://arstechnica.com/civis/threads/i-thought-hotmail-was-...


I'm surprised it can today.


I wonder how the GitHub to Azure migration is going


I have it on good authority they’re trying a lift and shift too and it’s not going well, at least as of ~9mo ago.


[flagged]


having used azure after long stints with AWS and GCP. I found Resource Groups a much more logical bucket, moreso than anything on AWS/GCP


Have you ever set up a AzureAD Tenant that can be used with Auth0 to validate any Office365 user (without setting up a specific connection to their own AD/Tenant)? I'm having trouble doing this and so any time I'm on a forum where someone seems to exhibit some experience with Azure, I ask for help. Can pay. DM me if interested, pls.


> pls

Lmao, almost spit out my coffee. Sorry, hope you find help.


Aws doesn't have resource groups because AWS regions are purly independant and tries to maintain that as a tenet.

Since azure resource groups has resources that spans regions, azure violates this tenet or maybe it's just not a tenet for them.


Unfortunately, resource groups seems to be the only thing going for it (imo of course). AWS does have resource groups but have not used them


In Aws, you can pay for "premium support" and get help from them for issues like this. Can even pay, get help and then cancel the sub, it's like $50 for the month.

Maybe azure has an equivalent.


Resource groups are fine, but it is quite unclear why a group has a region when resources with in may be in other regions.


The object has to be in a database in one of the regions I guess.


That is correct. Though all that metadata is replicated


Is there a case where it matters to the user which region their group is in if all of the resources live in a different group of regions?


For public regions, I don’t think so. At worst, there might be a tiny bit of extra latency as the orchestrator works across datacenters. But region is also used by Azure Stack and the rules might be different then. I worked on this a number of years ago and if you are really concerned, you should ask support.


I liked Resource Groups too, but not so much that it changed my (extremely negative, painful) experience using Azure over AWS and GCP.


And yet it's the biggest department with most revenue for Microsoft.


Yes, a product can be bad and a commercial success.


And that's because MS sets up the reporting for it to look that way. You can thank O365.


It's been a while I tried it but I really like that the UI doesn't "cheat" with internal apis.

Sure the UI is horrendous, I'm not going to defend it but you can open the web inspector, see what it's doing and use the exact same apis in your code. I can't say that for AWS.


Isn't that the norm nowadays? I know some early AWS products had logic just for the console, but I also know that that was seen as a huge mistake. I assumed that was industry wide.


Seems like it’s still a bit of an issue. This [1] is rather a long read, but definitely worth it if you are interested in platforms.

[1]: https://gist.github.com/chitchcock/1281611


That is a pretty old rant, at this point. Amusingly, google+ doesn't event exist anymore. :D


Maybe I'm odd, but even having more experience with AWS, and even with the UI on AWS being superior, Azure's jargon and organization seems easier to learn than AWS.


Me: I don't get how the GUI is confusing.

Seriously: don't use the GUI. Do you use the GUI for anything else?

Use Az pwsh or Az Cli.


GUIs are nice to set up experiments, knowing what you can set up and changing stuff quickly. For example, setting up a network and a subnet from scratch, creating a VM, connecting to it from the browser and setting up stuff. Could you do that with PowerShell, the CLI, ARM templates or Terraform?

Yes, but it'd be slower since you'd have to go through the docs to find the name of what you're looking for and typing it. Then type another command to see if the changes were applied. Another one to test the result... A huge PITA for experimenting.

Of course, for actual production use, you should definitely have some sort of IaC set up and use that. But for testing or home use (see the tech guy that uses a storage container and Front Door for a static website) the GUI is good enough.


It may surprise you to discover that yes – people use GUIs for things.


Or annoyingly, the GUI is the only way.

Case in point, if you are using Microsoft 365 and Exchange Online, you can do a lot of administrative tasks via PowerShell modules. But if you want to run a report on how many emails a set of mailboxes have received in the past 30 days, the message trace PowerShell command can only do up to the past 10 days. Anything beyond 10 days requires going into the Exchange Online admin portal and requesting a report that Microsoft will generate for you several hours later.


Could you script and schedule say a daily 1 day report, store the result, and have a script to get the last 30 (or whatever you want) days?


That is possible. Annoying that you can get the last 10 days in a second but need to duct tape together something if you never want to interface with the gui and then instead of a few hours you have to collect the data over the entire 30 days (but only after you start doing this).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: