They sell servers, but as a finished product. Not as a cobbled together mess of third party stuff where the vendor keeps shrugging if there is an integration problem. They integrated it. It comes with all the features they expect you to want if you wanted to build your own cloud.
Also, they wrote the software. And it's all open source. So no "sorry but the third party vendor dropped support for the bios". You get the source code. Even if Oxide goes bust, you can still salvage things in a pinch.
Ironically this looks like the realization of Richard Stallman's dream where users can help each other if something doesn't work.
Its a huge deal. I'm biased though because my own takes on how things should evolve were very similar. I was however completely unsuccessful in getting those ideas into production! And that, that is a huge deal. Through out my career it has been interesting to meet people with great ideas and then they are unable to get them into production, and when the idea does come into production everyone feels like "Wow, this is so obvious why didn't we do it sooner?" and some folks are banging their head against the wall :-).
One of the more interesting discussions I had during my tenure at Google was about the "size" of the unit of clusters. If you toured Google you got the whole "millions of cheap replaceable computers" mantra. Sitting in Building 42 was a "rack" which had cheap PC motherboards on "pizza dishes" without all that superfluous sheet metal. Bunches of these in a rack and a festoon of network cables. What are the "first class" elements of these machines? Compute? Networking? Storage? Did you replace components? Or a whole "pizza slice" (which Google called an 'index' at the time). Really a great systems analysis problem.
FWIW I'm more of a "chunk" guy (which is the direction 0xide went) and less of a "cluster" guy (which is the way Google organized their infrastructure). A lot of people associated with 0xide are folks I worked with at Sun in the early days and during that period the first hints of "beowulf" clusters vs "super computers", was memory one thing (UMA) or did it vary from place to place (NUMA). I have a paper I wrote from that time about "compute viscosity" where the effective compute rate (which at the time largely focused on transactional databases) scaled up with resource (more memory more transactions/sec for example) and scaled down with viscosity (higher latencies to get to state meant fewer transactions/sec) Sun was invested heavily in the TPC-C benchmarks at the time but they were just one load pattern one could optimize for.
These guys have capitalized on all that history and it is fucking amazing! I just hope they don't get killed by acquisition[1].
[1] KbA is a technique where people who are invested in the status quo and have resources available use those resources to force the investors in a disruptive technology to sell to them and then they quietly bury the disruptive technology.
Can you clarify a bit on what you mean by "chunk" guy? Are you alluding to the ability to distribute work by an isolation mechanism such as cgroups vs machine a-la borg/google?
More on infrastructure composition software is an abstraction above that.
Is the unit of composition a rack (chunk), a server (smaller chunk), or a blade (smallest chunk)? In what I think of as classic systems architecture you've got a 'store' (storage), 'state' (memory), 'threads' (computation), and 'interconnect' (fabrics). In the 90's a lot of folks focused on fabrics (Cray, Connection Machine, Sun, etc) somewhat on threads (compute blades), and state came along for the ride. How these systems were composed was always a big thing, then along came the first Beowulf clusters that used off the shelf motherboards (a "chunk" of threads/state/store) with a generic fabric (Ethernet). Originally NASA showed that you could do some highly parallel processing on these sorts of systems and Larry and Sergei at Stanford applied it to the process of internet search.
Collectively you have a 'system resource' and with software you can make it look like anything you want. When you do compute with it, its performance becomes a function of its systems balance and the demands of the workload. Its all computer sciencey and yes there is a calculus to it. This isn't something that most people dive into (or are even interested in[1]) but it was one of the things that captured my imagination early on as an engineer. I was consumed with questions like what was the difference between a microprocessor, a mini-computer, a workstation, and a mainframe? Why do they each exist? What does one do that they other can't? Things like that.
[1] At Google I worked in what they called 'Platforms' early on and clearly most of the company didn't really care about the ins and outs of the systems bigtable/gfs/spanner/etc ran on, they just wanted APIs to call. But they also didn't care about utilization or costs. By the time I left some folks had just figured out (and one guy was building his career on) the fact that utilization directly affected operational costs. They still hadn't started thinking about non-uniform rack configurations for different workloads.
> I just hope they don't get killed by acquisition.
Private equity in the US has collectively determined that no company shall exist outside of investment ownership. I don't know what the ownership structure looks like, but generally speaking, it seems that nearly everyone has a "fuck you" number. Now that Oxide is venturing into Dell and HP's turf, I worry someone will get a fix on what Brian's number is.
Coupling vs Decoupling is not some one-sided thing. It's a major trade-off.
One of the most obvious examples of the problem with this approach is that they're shipping previous generation servers on Day 1. One can easily buy current generation AMD servers from a number of vendors.
They will also likely charge a significant premium over decoupled vendors that are forced to compete head-to-head for a specific role (server vendor, switch vendor, etc).
Their coupling approach will most likely leave them perpetually behind and more expensive.
But there are advantages too. Their stuff should be simpler to use and require less in-house expertise to operate well.
This is probably a reasonable trade-off for government agencies and the like, but will probably never be ideal for more savvy customers.
And I don't know how truly open source their work is but if it's truly open source, they'll most likely find themselves turned into a software company with an open core model. Other vendors that are already at scale can almost certainly assemble hardware better than they can.
> will probably never be ideal for more savvy customers
IDK about every use case, but slightly older generations of CPUs would affect me roughly zero. I'm sure there are things so compute-intensive that one would care very much, but a lot of people probably wouldn't bat an eye about that, and not because they're unsavvy.
To the extent that these things are supported as a whole by the vendor rather than a bunch of finger pointing though, that could be massive, specifically in terms of how many staff members you could "not hire" compared to if you had to employ someone to both build and continually maintain it.
I'm posting this not to invalidate what you're saying, just to say that a little predictable upfront amount of money (the premium) will be spent very happily by lots of people who value predictability and TCO over initial price.
If you're not rapidly scaling it probably doesn't matter. But if you're still buying (and maybe even using) Haswell CPUs in 2023, you may be missing out in a big way.
A moderately large Haswell cluster is equivalent in power to a moderately powerful modern server.
No not buying new, just using what was bought years ago. It still works, it does the job. Is it the best performance per watt, clearly no but the budget for electricity and the budget for new capital expenses are two different things.
If you go on Google cloud and select an E2 instance type (atleast in `us-central1` where my company runs most of it's infra) you'll usually get Broadwell chips.
> They will also likely charge a significant premium over decoupled vendors
It seems like they're trying to hit a middle ground between cloud vendors and fully decoupled server equipment companies.
Using Oxide is likely cheaper over the life of the hardware than using a cloud vendor. A company who already has in-house expertise on running racks of systems may be less the target market here than people who want to do cloud computing but under their own control.
> A company who already has in-house expertise on running racks of systems may be less the target market here than people who want to do cloud computing but under their own control.
True, but Oxide may find themselves competing against Dell or HP if they adopt Oxides software for their respective servers. Additionally, Oxide may find itself competing against consultants and vendors in specialized verticals (e.g. core Banking software + Oracle DB + COTS servers + Oxide software). Oxide, and their competitors are going for people who used to buy racks of Sun hardware.
HP and Dell would have to fundamentally change the way they design hardware and software to be that kind of threat, and if that ever happens I think I would be pretty okay with that outcome.
> One of the most obvious examples of the problem with this approach is that they're shipping previous generation servers on Day 1. One can easily buy current generation AMD servers from a number of vendors.
> Their coupling approach will most likely leave them perpetually behind
This is a startup that took years to get their initial hardware developed. The time between this version and the version using the next version of AMD chips will be shorter than the time it took to develop this product. This is not an inherent issue with coupling vs decoupling.
Also, most servers are rarely running on the most recent cpus anyway. At least in companies I've worked at with on-site hardware they're usually years (sometimes even a decade) out of date getting the last life sucked out of them before too many internal users start complaining and they get replaced.
Coupling requires more integration work, including writing and testing custom firmware. Oxide will be a tiny market player for a long time, even if things go very well. Are AMD and Broadcom really going to spend as much time helping Oxide as they do helping Dell? Of course not, Oxide's order volume will be a rounding error.
I'm sure they'll improve their processes over time but the lag will probably always be a non-zero value. Hopefully they'll be able to keep it low enough that it's not an important factor but as a customer it's certainly something one should consider.
It would be surprising if they don't run into some nasty issue that leaves their customers 6+ months behind on servers or switches at some point.
From listening to their talks they've actually gotten pretty good direct responses from AMD and AMD likes them quite a bit. They've done what no other system integrator has done and brought up the CPU without using AMD's AGESA firmware bootloader. By simplifying the system they've reduced the workload on what they need to handle.
As to your second point, unless AMD somehow becomes supply constrained and only wants to ship to their most important customers first I don't see a future where there would be any lag. Again, the delay this time is from how long it took from company start until product release. Future delays will be based on the time it takes from them getting early development parts to released products, which they could even possibly beat Dell to market on given the smaller company size and IMO more skilled employees.
> It would be surprising if they don't run into some nasty issue that leaves their customers 6+ months behind on servers or switches at some point.
I mean they've already hit tons of nasty issues, for example finding two zero-day vulnerabilities in their chosen security processor. They've shown they can work around issues pretty well.
> it would be surprising if they don't run into some nasty issue that leaves their customers 6+ months behind on servers or switches at some point.
I just think your premise is wrong - most customers don't care about not having the absolute latest and greatest. Indeed they will often avoid them because
1. They are new so more likely to have as yet undiscovered issues ( hardware or drivers ).
2. If you buy top end, they sell at a premium well above their performance premium.
ie the customers who are perennially chasing the latest hardware are in the minority.
Most customers care about having the best of the available options. Rarely would any company deliberately choose to be behind where their competitors can be.
1. The way to run into undiscovered issues is to choose a completely custom firmware/hardware/software stack that almost no one else in the world is running.
2. Not sure where you're getting this from. There is almost always a price:performance calculation that results in current generation smashing the previous generation with server and switch hardware. Often this means not buying the flagship chips but still the current generation.
And a major reason to get off old generations of hardware is that they become unavailable relatively quickly. It's always easier to buy current generation hardware than previous generation hardware, especially a couple years into the current generation. This has nothing to do with chasing the latest hardware.
> And a major reason to get off old generations of hardware is that they become unavailable relatively quickly.
That's not in the customers interests per se- in fact it's a pain. Having control of their own stuff could mean they could offer a much longer effective operational life.
> The way to run into undiscovered issues is to choose a completely custom firmware/hardware/software stack that almost no one else in the world is running.
What breaks stuff is change - sure when they are starting up it's higher risk - but again if they can manage the lifecycle better, not have change for changes sake, then they could be much more reliable.
> Not sure where you're getting this from.
I was talking about not taking the flagship stuff - which is typically a few months ahead of the best price/performance stuff.
If they standardize and open the server shape and plug interface then it gets really cool. Then I could go design a GPU server myself and add it to their rack. The rack is no longer a hyperconverged single-user proprietary setup and becomes something that can be extended and repurposed.
I don't see it as a big deal - rather, I see it as a huge amount of venture cap spent on some very bright people to build something no one really wants, or, at best, is niche.
Also, it has little to do with the cloud; it is yet another hyperconverged infra.
Weirdly, it is attached to something very few people want: Solaris. This relates to the people behind it who still can't figure out why Linux won and Solaris didn't.
When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.
Yes they are using illumos/Solaris to host this but they don't sell on that, they sell on the functionality of this layer — allowing people to deploy to owned infra in a way that is similar to how they'd deploy to AWS or Azure. How much do you ever think about the system hosting your VM on those clouds? You think about your VMs, the API or web interface to deploy and configure, but not the host OS. With Oxide racks the customers are not maintaining the illumos substrate (as long as Oxide is around).
You could be right about demand, there is risk in a venture like this. But presumably the team thought about this - I think folks who worked at Sun, Oracle, Joyent, and Samsung and made SmartOS probably developed a decent sense of market demand, enough to make a convincing case to their funders.
I have a feeling they knew exactly from the start who their customers would be: People who have the budget to care about things like trust and observability in a complex system. But these would also be the kind of customers who require absolute secrecy and so this why you don't hear about them even though they might have bankrolled a sizable portion of the operation. Just like the first Cray to officially be shipped was actually serial number 2...
> When you're deploying VMs, which is the use case here, the substrate OS becomes significantly less important. Those VMs will mostly just be linux.
Now you need to know both the OS they chose and the OS you chose...
(No, I don't believe it'll be 100% hands-off for the host. This is an early stage product, with a lot of custom parts, their own distributed block storage, hypervisor, and so on.)
This true for other hypervisors too. Enterprises are still paying hundreds of millions to VMware, who knows what's going on in there?
I wouldn't have picked Opensolaris, but it's a lot better than other vendors that are either fully closed source, or thin proprietary wrappers over Linux with spotty coverage and you're not allowed to touch the underlying OS for risk of disrupting the managed product.
What's more important is that the team actually knows Illumos/Solaris inside out. You can work wonders with a less than ideal system. That said, Illumos is of high quality in my opinion.
Seems risky considering how small of a developer pool actively works on illumos/Solaris. The code is most definitely well engineered and correct, but there are huge teams all around the world deploying on huge pools of Linux compute that have contributed back to Linux.
They had a bug in the database they are using that was due to a Go system library not behaving correctly specifically on illumos. They've got enough engineering power to deal with such a thing but damn..
Linux grew up in the bedrooms of teenagers. It was risky in the era of 486 and Pentiums. The environment and business criticality of a $1-2M rack-size computer is quite different.
I had similar thoughts about VMware (large installations) back in the day. Weird proprietary OS to run other operating systems? Yet they turned out fine.
This appears to be a much better system than VMware, is free as in software, and it builds upon a free software operating system with lineage that predates Linux.
I say this in the most critical way possible, as someone who has built multiple Linux-based "cloud systems", and as a GNU/Linux distribution developer: I love it!
It was totally a risky choice for companies in the 1990s and early 2000s to put all their web stuff onto Linux on commodity hardware instead of proprietary Unix or Windows servers. Many did it when their website being up was totally mission critical. Lots did it on huge server farms. It paid off very quickly but it's erasing history to suggest that it didn't require huge amounts of guts, savvy and agility to even attempt it.
Indeed, for me GNU/Linux was always a cheap way to have UNIX at home, given that Windows NT POSIX support never was that great.
The first time I actually saw GNU/Linux powering something in production was in 2003, when I joined CERN and they were replacing their use of Solaris, and eventually alongside Fermilabs came up with Scientific Linux in 2004.
Later at Nokia, it took them until 2006 to consider Red-Hat Linux a serious alternative to their HP-UX infrastructure.
Completely tangential, but this reminds me of an interview I had for my first job out of college in 1995. I mentioned to the interviewer that I had some Linux experience. "Ah, Linux" he said. "A cool little toy that's gonna take over the world".
In hindsight of course it was remarkably prescient. This from a guy at a company that was built entirely around SGI at the time.
This is a skewed view - the critical piece that made Linux "enterprise-ish" was the memory management system that was contributed by IBM, part of the SCO lawsuit
Back in the day... Sun Micro was a GOAT and pushed the envelope on Unix computing 20-30 years ago. Solaris was stable and high performing.
I don't run on-prem clusters or clouds but know a couple people who do and, at large enough scale, it is a constant "fuck-shit-stack on top of itself" (to quote Reggie Watts). There is almost always something wrong and some people upset about it.
The promise of a fully integrated system (compute HW, network HW, all firmware/drivers written by experts using Rust wherever possible) that pays attention to optimizing all your OpEx metrics is a big deal.
It may take Oxide a couple more years to really break into the market in a big way, but if they can stick it out, they will do very well.
It won't. In the same way that AWS customers aren't debugging hypervisor, or Dell customers aren't debugging the BIOS, or Samsung SSD customers aren't debugging the firmware. Products choose where to draw the line between customer-serviceable parts and those that require a support call. In this case, expect Oxide to fix it when something doesn't work right.
When Apple supports OSX for consumers, they don't exactly surface the fact that there's BSD semi-hidden in there somewhere.
That's because they own the whole stack, from CPU to GUI and support it as a unit. That's the benefit of having a product where a single owner builds and supports it as a whole.
My impression of Oxide is that that's the level of single source of truth they are bringing to enterprise in-house cloud. So, I strongly doubt the innards would ever become customer-facing (unless the customer specifically wants that, being open source after all).
Apple is a horrible example, with Apple when you have a problem, you often end up with an unfixable issue that Apple won't even acknowledge. You definitely don't want to taint Oxide's reputation with that association.
As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources. Their computers expensive enough to be very much business critical. You'll get some support by Oxide logging in remotely to customer systems and digging around, but pretty soon the customer will want to do that themselves to monitor/troubleshoot the problems as they happen.
Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.
> As for why I think Helios will become customer facing: Oxide is a small startup. They have limited resources.
Have you looked at the pedigree of many of the people behind the project? I don't say this because "these guys smart", but because these guys bent over backwards for their customers when they were Sun engineers. Bryan didn't write dtrace for nothing.
> Imagine you're observing a recurring but rare I/O slowdown that seems to trigger under some certain conditions, and tell me a competent sysadmin wouldn't want to log in on all the related boxes (client Helios, >=3 server Helioses for the block store) and look at the logs & stats.
I think you're simultaneously over-estimating and under-estimating the people who will deploy this. There's a lot of companies who would want a "cloud in a box" that would happily plug hardware in and submit a support ticket if they ever find an issue, because their system engineers either don't have the time, desire, or competence (unfortunately common) to do anything more. The ones who are happy to start debugging stuff on their own would have absolutely wonderful tooling at their fingertips (dtrace) and wouldn't have any issue figuring out how to adapt to something other than Linux (hell, I've been running TrueNAS for the better part of a decade and being on a *BSD has never bothered me).
Apple is a great example of the benefits of an integrated system where the hardware and software are designed together. There are tons of benefits to that.
What makes Apple evil (IMO, many people disagree) is how everything is secret and proprietary and welded shut. But that doesn't take away from the benefits of an integrated hardware/software ecosystem.
Oxide is open source so it doesn't suffer from the evil aspect but benefits from the goodness of engineered integration. Or so I hope.
In practice I don't think it's as good as in theory. I had Apple Macbook Pro with Apple Monitor, and 50% of the time when unplugging the monitor the laptop screen would stay off. Plugging back in to the monitor wouldn't work at that point so all I could do was hold the power button to force it off and reboot. That's with Apple controlling the entire stack - software, hardware, etc.
I think the real benefit is being able to move/deprecate/expand at will. For example, want an app that would require special hardware? You can just add it. Want to drop support for old drivers? Just stop selling them and then drop (deprecate) the software support in the next release.
I fully agree about the evilness, and it baffles me how few people do!
Android is potentially a better example. Compare Android to trying to get Linux working on <some random laptop>. You might get lucky and it works out of the box or you might find yourself in a 15 page "how to fix <finger print reader, ambient light sensor, etc>" wiki where you end up compiling a bunch of stuff with random patches.
Afaik Android phones tend to have a lot more hardware than your average laptop, too (cell modem, gps, multiple cameras, gyro, accelerometer, light sensors, finger print readers)
Apple is the survivor of 16 bit home micros integration, PC clones only happened as IBM failed to prevent Compaq's reverse engineering to take over their creation, they even tried to get hold of it afterwards via PS/2 and MCA.
As we see nowadays on tablets and laptops, most OEMs are quite keen in returning back to those days, as otherwise there is hardly any money left on PC components.
Funny how you mentioning BSD got me to thinking of Sony Playstation and Nintendo Switch. Which are proprietary and not user serviceable. A Steam Deck, Fairphone, or Framework laptop is each less proprietary and more FOSS stack, and user serviceable. Which a user may or may not want to do themselves; at the very least they can pay someone and have them manage it.
Also, Apple is just the one who survived. Previously I'd have thought of SGI, DEC, Sun, HP, IBM, Dell some of whom survived some not.
Those three consumer products I mentioned each provide a platform for a user and business space to floroush and thrive. I expect a company doing something similar for cloud computing to want the same. But it will require some magick: momentum, money, trust. That kind of stuff, and loads of it. (With some big names behind it and a lot of FOSS they got me excited, but I don't matter.)
If you have a bug in how a lambda function is run on AWS, do you find yourself looking for the bug in firecracker? It is open source, so you technically could, but I just don't see many customers doing that. Same can be said about KNative on GCP.
Their choice in foundation OS (for lack of a better term) really should not matter to any customer.
Ok but then that is purely additive then, right? Like, "have to find someone with Illumos expertise to fix something that was never intended to be customer-facing" may not be easy, but is still easier than the impossibility of doing the same thing on AWS / Azure / Google Cloud.
Right, who wants or benefits from open source firmware anyway.
Also there are many situations where renting, for example a flat makes a lot of sense. And there are many situations where the financials and or enabled options of owning something make a lot of sense. Right now, the kind of experience you get with AWS and co. can only be rented, not bought. Some people want to buy houses instead of renting them.
Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.
> Well, you can buy your own hardware and set it up with OpenStack and use it as a private cloud. Companies like Canonical or Redhat make a lot of money by providing software (mostly open source) to support exactly that use case.
Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?
If you haven't lived through this, the answer is: nobody. Everyone points fingers at the other 4 and ignore your calls.
Back in the day you could buy a fully integrated system (from CPU to hardware to OS) from Sun or SGI or HP and you had a single company to answer all the calls, so it was much better. Today you can't really get this level of integration and support anymore.
(Actually, you probably can from IBM, which is why they're still around. But I have no experience in the IBM universe.)
This is why Oxide is so exciting to me. I hope I can be in a company that becomes a customer at some point.
>Sure you can, but then who will diagnose and fix your hardware/OS interaction problems when you have parts from five vendors in the mix?
Dell is a single vendor that will diagnose and fix all of your hardware issues.
With Oxide you're locked into what looks like a Solaris derivative OS running on the metal and you're only allowed to provision VMs which is a huge disadvantage.
I run a fleet of over 30,000 nodes in three continents and the majority is Flatcar Linux running on bare metal. Also have a decent amount of RHEL running for specific apps. We can pick and choose our bare metal OS which is something you cannot do with Oxide. That's a tough pill to swallow.
> Dell is a single vendor that will diagnose and fix all of your hardware issues.
I've been a Dell customer at a previous company. I know for a fact that's not true.
I had a support ticket for a weird firmware bug open for two years, they could never figure it out. I left that job but for all I know the case is still open many years later.
Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell. Dell is a reseller who puts components together from a bunch of vendors and it mostly works but when it doesn't, there's nobody on staff who can fix it.
I've been a Dell customer for decades at this rate and I know for a fact it's true.
I've had support tickets open for all kinda of weird firmware, hardware, etc. bugs and they've been well resolved, even if it meant Dell just replaced the part with something comparable (NIC swap).
>Dell doesn't know how to fix things like that because they don't design and engineer the systems they sell.
Of course they do. That's like saying Oxide doesn't know how to fix stuff because they don't design the CPU, NVMe, DIMMs, etc. Oxide is still going to vendors for these things.
Ironically, it was Dell's total inability to resolve a pathological rash of uncorrectable memory errors very much is part of the origin story of Oxide: this issue was very important to my employer (who was a galactic Dell customer) and as the issue endured and Dell escalated internally, it became increasingly clear that there was in fact no one at Dell who could help us -- Dell did not understand how their own systems work.
At Oxide, we have been deliberate at every step, designing from first principles whenever possible. (We -- unlike essentially everyone else -- did not simply iterate from a reference design.)
To make this concrete with respect to the CPU in particular, we have done our own lowest-level platform enablement software[0] -- we have no BIOS. No one -- not the hyperscalers, not the ODMs and certainly not Dell -- has done this, and even AMD didn't think we could pull it off. Why did we do it this way? Because all along our lodestar was that problem that Dell was useless to us on -- that we wanted to understand these systems from first principles, because we have felt that that is essential to deliver the product that we ourselves wanted to by.
There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.
As a side question, what's the name of your custom firmware that is the replacement of the AGESA bootloader? I tried searching on the oxide github page but couldn't find anything that seemed to fit that description.
(The AGESA bootloader -- or ABL -- is in the AMD PSP.) In terms of our replacement for AGESA: the PSP boots to our first instruction, which is the pico host bootloader, phbl[0]. phbl then loads the actual operating system[1], which performs platform enablement as part of booting. (This is pretty involved, but to give you a flavor, see, e.g. initialization of the DXIO engine.[2])
Thanks, are the important oxide branches of illumos-gate repo (and any other cloned repos) defined anywhere? I definitely wouldn't have found that branch without you mentioning it here.
Interesting enough I also ran into something somewhat related with Dell that they were not able to resolve so they ended up working in a replacement from another vendor.
Nonetheless, it is quite interesting what you've built, but as the end user I'm not quote convinced that it matters. Sure you can claim it reduces attack vectors and such but we'll still see Dells and IBMs in the most restricted and highest security postured sites in the world. Think DoD and such. Core/libreboot with RoT will get me through compliance the same.
The software management plane y'all built is the headlining feature IMHO, not so much what happens behind the scenes that the vast majority of the time will not have a fatal catastrophic upstream effect.
>There are plenty of valid criticisms of Oxide -- but that we don't understand our system simply isn't one of them.
That's not what I said. There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.
> Off by a few orders of magnitude. Dell on-site SLA with pre-purchased spares was about 6 hours.
You're talking about replacement parts. Yes Dell is good about that.
The discussion above is asking them to diagnose and fix a problem with the interaction of various hardware components (all of which come from third parties).
But they _are_ writing the firmware that runs most of them and need to understand those devices at a deep level in order to do that, unlike Dell. Dell slaps together hardware and firmware from other vendors with some high level software of their own on top. They don't do the low level firmware and thus don't understand the low level intricacies of their own systems.
No they're not unless I'm mistaken. They're not writing the firmware that runs on the NVMe drives, nor the NICs (they're not even writing the drivers for some of the NICs), etc.
There's a line in the sand that you must cross when it comes to understanding the true nature of the componentry that you're using. At the end of the day, your AMD CPUs may be lying to you, to all of us, but we just don't know it yet.
I'm not speaking hypothetically. If you hit a "zero-day" bug that Dell has never seen it's going to take time. And somehow every large customer finds bugs that Dell certification didn't.
> And somehow every large customer finds bugs that Dell certification didn't.
It's a law of computer engineering.
In the Apollo 11 decent sequence the Rendezvous Radar experienced a hardware bug[0] not uncovered during simulation. They found it later, but until then, the solution was adding a "turn off Rendezvous Radar" checklist item.
[0] The Rendezvous Radar would stop the CPU, shuttle some data into areas it could be read, and woke the CPU back up to process it. The bug caused it to supuriously do this dance just to tell it "no new data", which then caused other systems to overload.
It's ironic coming from a company who's CTO has harped about containers on bare metal for years. Maybe a large swath only need to deploy VMs, but the future will most definitely involve bare metal for many use cases, and oddly Oxide doesn't support that currently.
See the pattern? Dell only care about the big guys.
Set aside the childish tone ...
> Dell is a single vendor that will diagnose and fix all of your hardware issues.
There are two anecdotes here disagreeing with you, and frankly that's enough to say what you said above isn't true, not universally so. I doubt Odixe is targeting big deployment like yours, but more like theirs. Whether they will succeed is another matter, but they do have a valid sales pitch and the expertise to pull it off.
So OpenBMC is fine (happy for them!), but having open firmware is much deeper and broader than that: yes, it's the service processor (in contrast to the BMC which is a closed part on Dell machines) -- but it's also the root-of-trust and (especially) the host CPU itself. We at Oxide have open source software from first instruction out of the AMD PSP; I elaborated more on our approach in my OSFC 2022 talk.[0]
Dell uses trusted platform modules (TPM). It's a separate chipset than the BMC chipset.
For a mostly open source solution, not only would you need open source BMC firmware, you must have an open source UEFI/BIOS/boot firmware like CoreBoot, LinuxBoot, Oreboot, Uboot, etc.
The fact that it's not on linux is one of the great things about it. There is too much linux on critical infrastructure already and the monoculture just keeps on growing.
At least with Oxide there is a glimmer of hope for a better future in this regard.
They sell rack-as-compute.[0] Their minimum order is one rack: You plug in power and network, connect to the built-in management software (API), and start spinning up VMs.
It would be interesting to sell a data center in a container. Cooling, power supply, compute, storage, and network, all in a box. You supply power, a big network pipe, and the piping to external heat exchangers.
IIJ has a project like this, data center in a container, just add power. They build it all up in Japan, ship it to rural areas across the world to basically jumpstart a local data center (I imagine mostly for industrial sites). They had a fun project where they had a half rack, powered by solar and connected to the net via Starlink.
There are whitebox Windows laptops, OpenWRT routers, Arduino boards, ArduPilot drones, etc. It almost sounds strange that there are no prepopulated 12U racks intended for OpenStack(is that still a thing?)
No idea how they do things today, but v0.1 of Azure (before it was called Azure) was a bunch of containers in a field. I remember seeing an aerial photo at the time.
Yeah including networking and storage together with virtualization is what makes hyperconverged infrastructure hyperconverged. Otherwise it's usually just called converged infrastructure.
It's not a new concept, it's a new product. Ideas do not become uninteresting the moment the first person instantiates the idea. Further iterations on a possibly-good idea are also interesting.
Obviously different marketing copy speaks to different people. But that is referring to how, when you buy a rack from us, you don't need to put everything together and cable it all up: you pull it out of the box, plug in networking and power, boot the thing up, and you're good to go. Installation time is hours, not days or weeks, which is the norm.
"no cables no assembly just cloud" is completely misleading to any kind of people - tech or marketing or not.
When people hear cloud, it means that aspects such as electricity costs, electricity stability, Internet, bandwidth, fire protection, safety, etc etc are abstracted away.
Oxide IS on-premise, right? The website is very vague and wishy-washy.
It is on premises. You interact with the rack the same way you interact with the public cloud: as a pool of resources. The specifics are abstracted away. “Private cloud” is pretty well established terminology in this space, and that’s what we’re doing.
At this stage of the company, everyone gets a white-glove installation process. I suspect that will change over time but I don't work on that part of things, so I don't personally know the details.
Sorry to be slightly obtuse, which details are you referring to here? Help upon installation? At the moment, we are helping customers individually, yeah. But we do have a documented process we are following https://docs.oxide.computer/guides/system/rack-installation-... (and more on other pages there)
Ah yeah, so the "facilities" section of https://oxide.computer/product/specifications has some of these things, probably the closest we have to publicly publishing that in a general sense.
Yes I understand, but will your included service actually verify that everything is set up correctly, meets advertised parameters, and sign off on it? (Such that the customer can start using it immediately afterwards.)
Or does the customer need to take on some risk and hazard associated with installation, configuration, initial boot up, etc.?
e.g. If someone buys with the intention of using it up to X FLOPS, and the machine only delivers Y FLOPS once it's all said and done, what happens?
It’s not the area of the company I personally work on, so I don’t know those details, to be honest. We certainly make sure that everything is working properly.
I mean, we absolutely sell support. I just don't know anything about the details personally. You shouldn't take my lack of knowledge as a "no," just a "steve doesn't personally know."
To be super clear about it, this is referring to not needing to cable up all of the individual sleds to the rack upon installation. It doesn't mean that we recommend connecting a rack of compute to your data center via wifi.
This is pretty big, as someone who has deployed servers to datacenters before. Remote hands are very good at plugging in the network uplink and the PDUs. Doing a complete leaf-spine 25GbE network with full redundancy is something they are pretty much guaranteed to screw up at some point.
I wouldn't be dismissive of people telling you that the product description can be improved. My opinion is that the description of the product in this thread will outperform your site 10 to 1.
I'll try to explain, not in the spirit of being argumentative, but with the hope of being useful.
The comment you replied to was not questioning the value of integrated cabling. It was pointing out that the product description on the site does not make sense.
"Cloud computer" sounds like a server you rent from AWS. It's kind of like calling Rust "cloud compiler."
If you choose to use words that your audience doesn't understand, or even worse understands to mean the opposite of what you want them to mean, it's a good idea to explain these words immediately using conventional words with conventional meaning. The comments by throw0101a did that.
The product seems really cool, but there is no way I would've understood what it was from the website.
I understand that's what you're saying, and I understand what the parent is saying. I chose to explain what that alluded to, in case anyone in this conversation is also finding it hard to understand what is meant by that specific copy. That doesn't mean I don't understand the broader point, or that I think the website copy is perfect.
Perhaps if you don't understand what the copy means, then that is a sign that you are not the target audience, rather than that the copy is bad? From what I've gathered from reading other comments in this thread, that copy will make perfect sense to Oxide's target audience, as it uses words in a way that will be very familiar and make perfect sense to the kind of person who might make a purchasing decision for a system like this.
And for what it's worth, I don't think you need to explain what's happening to Steve, it seems to me that he understands perfectly well. To me you come across as being rather condescending and in my opinion Steve is being commendably polite in response.
"Real" mainframes have RAS (Reliability, Availability, Servicing) features such as hotswapping for all hardware components and automated HA/workload migration across physical racks. They can also do SSI (single system image), i.e. run a single workload across physical nodes/racks as if it was just multiple 'cores' in a single shared-memory computer. Oxide computers will probably end up doing at least some of this (namely workload migration across racks for HA) but saying that it can comprehensively replace mainframe hardware as-is is a bit of a stretch. In terms of existing hardware it's closer to a midrange computer.
The Oxide and Friends podcast had an episode on virtualizing time, specifically for the purpose of live-migrating a container from rack to rack without the VM being aware, and allowing operators to take the rack offline on their schedule. Otherwise, apparently, you end up having
to leave racks running because you cannot evacuate all of the containers currently running on it. (e.g. perhaps your contracts or SLAs are such that you cannot afford even the few seconds of downtime a shut-down-here-and-spin-up-elsewhere would cause)
I believe the episode name was "Virtualizing Time"
The first iMac famously made it easy to connect to the Internet; The 'i' in iMac was for "Internet". Its setup manual was a couple of pages long, mostly pictures and IIRC, just 37 words.
Existing vendors will provide rack integration services and deliver a turn key solution like this. Also vendors of virtualization management software have partnerships with hardware suppliers and be happy to deliver fully integrated solutions if you're buying by the rack. The difference is in those cases you have flexibility in the design which seems to be missing here.
Proxmox and a full rack of Supermicro gear would not be as sophisticated, but end result is pretty much the same, with I imagine far far better bang for buck.
I like it, but it doesn't seem like a big deal or revolutionary in any way.
Those of us who've bought large "turn-key" solutions from Dell etc. have often discovered that it's actually just a cobbled-together bunch of things which may or may not work well together on a good day, depending on what you're trying to do. Just because it's all got the word "Dell" written on it, doesn't mean that the components were all engineered by people who were working together to build a single working system.
Total agreement. Another point: Having the "Dell" name on the front doesn't give you a "throat to choke" as so many people seem to think is important. Unless you're very large scale then, at best, you can threaten them that they don't get your next business. You're certainly not going to get help.
You're no worse-off with Oxide from that perspective. Their open source firmware means that thr opportunity to pay somebody else to support you at least exists.
Even small shops can use bad experience as leverage for credits and discounts, especially if the vendor has account managers. This is one of the (few) benefits of having a human involved in invoicing vs. self-serve.
Same is true of Oxide, it'll be up to actual experience to see how well it works. Oxide seems to have written their own distributed block storage system (https://github.com/oxidecomputer/crucible), have their own firmware, kernel and hypervisor forks, etc -- when any of that breaks, good luck!
The premise is that you don't need luck, you can call Oxide. As you said, they wrote all of it, so they own all the interaction so they can diagnose all of it.
When I call Dell with a problem between my OS filesystem and the bus and the hardware RAID, there's at least three vendors involved there so Dell doesn't actually employ anyone that knows all of it so they can't fix it.
Sure, Oxide now needs to deliver on that support promise but at least they are uniquely positioned to be able to do it.
> That's the same premise as with all "turn-key" solutions. If it didn't come with software support, it wasn't really turn-key.
Just about any company will sell your company a support contract.
The more interesting question is, can they back it up with action when push comes to shove? I suspect most people have plenty of stories of opening support tickets with big name vendors that never get resolved. And through the grapevine you find out that they won't fix it because they can't fix it. They might not even have access to the source code or anyone on staff who has a clue about it because it came from who knows where. Sales is happy to sell you the support contract but it doesn't mean your problems can be fixed. BTDT.
From listening to the Oxide podcasts, my impression is that Oxide actually can technically fix anything in the stack they sell, which would make them vastly different from Dell et.al.
Skill-wise, yes for sure (except perhaps for storage -- I haven't heard them talk about that much). Bandwidth wise, though?
I used to work for a company targeting Fortune 500s. At that level of spend, when a client had a problem, somebody got on a plane. Only a fraction of those problems escalated all the way to R&D, which is where Oxide skills are. That's where VMWare etc are hard to beat.
The premise is that the bandwidth needed will be orders of magnitude less, because the engineering will be orders of magnitude better. The opportunity makes sense as we've long been climbing up the local maximum peak of enterprise sales driven tech behemoths built on a cobbled together mix of open source and proprietary pieces held together with bubblegum.
Can an engineering first approach break into the cloud market? Hard to say as enterprise sales is very powerful, and the numerous "worse is better" forces always loom large in these endeavours. That said, enterprise sales driven companies are fat, slow and complacent. Oxide is lean and driven, and a handful of killer use cases and success stories is probably enough to sustain them and could be the thin end of the wedge on long-term success. We can hope anyway.
> Proxmox and a full rack of Supermicro gear would not be as sophisticated, but end result is pretty much the same, with I imagine far far better bang for buck.
I think the question is how well they can do the management plane. Dealing with the "quirks" of a bunch of grey box supermicro stuff is always painful in one way or another. The drop shipped, pre-cabled cab setups are definitely nice but that's only a part of what Oxide is doing here. No cables and their own integrated switching sounds nice too (stuff from the big vendors like UCS is closer to this ballpark but also probably closer to the cost too).
I suspect cooling and rack density could be better in the Oxide solution too, not having to conform to the standards might afford them some possibilities (although that's just a guess, and even if they do improve there these may not be the bottlenecks for many).
> Existing vendors will provide rack integration services and deliver a turn key solution like this.
My experience with the likes of Dell is that they'll deliver it but they won't support it.
Sure, there's a support contract. And they try. But while they sell a box that says Dell, the innards are a hodgepodge of stuff from other places. So when certain firmware doesn't work with something else, they actually can't help because they don't own it, they're just a reseller.
AWS outposts have been there in the market for a long time .. though I am sure there are differences but to say extisting cloud vendors were blind to on prem requirements is a stretch.
Also future datacenter builds are going to be focusing on specific applications which means specific builds. I think Nvidia has a much better chance here with their superpod than Oxide. The target use case is pretty unclear.
On-prem buyers are doing cost reduction and cost reduction targets things like, as one example, the crazy cost of GPU servers on the CSPs. Your run of the mill stuff is very hard to cost reduce.
You can see their sort of lack of getting it by using Tofino2 as their switch. That’s just a very bad choice that was almost certainly chosen for bad reasons.
You don't build a new greenfield compute pod because you want to, you do it because it makes sense. Making sense is about cost and non-cost needs like data gravity and regulatory issues.
The cost case only works for GPU heavy workloads which this isn’t - wrong chassis, wrong network, etc.
Tofino2 is the wrong choice because even when they made that choice it would have been clear that it’s doa. Intel networking has not been a success center in, well, ever. That’s a selection that could only have been made for nerd reasons and not sensible business goals alignment or risk mitigation.
When you make an integrated solution you’d better be the best or close to the best at everything. This does not seem to be the best at anything. I will grant that it is elegant and largely nicer than the hyper converged story from other vendors but in practical terms this is the 2000s era rack scale VxBlock from Cisco or whatever Dell or HPE package today. Marginally better blade server is not a business.
They also make a big deal and have focused on things no one who actually builds data center pods cares about.
I actually hope they get bought by Dell or HPE or SuperMicro. Those companies could fix what’s wrong here and benefit a lot from the attention to detail and elegance on display.
>They sell servers, but as a finished product. Not as a cobbled together mess of third party stuff where the vendor keeps shrugging if there is an integration >problem. They integrated it.
I’m actually extremely impressed. I want one. I haven’t worked in a data center in years, but I’d be tempted to do it again just to get my hands on one.
I wish they’d sell a tabletop version for hobbyists, but realize this is probably a distraction. But… the problem with a lot of these systems (including the old Sun boxes and things like ibm mainframes and the AS/400) is that they sound cool but there’s no real way for the typical new developer to “get into them” for fun and, as a result, you lose the chance for some developer selling it to their company based on his experience with the things.
Apparently (I don't remember it, although I probably did read the Byte magazine at the time) there was a rumor in the early 1980s that IBM's PC was going to be a shrunken 370, called the 380. [1][2]
I wish IBM would shrink their LinuxONE Rockhopper 4 Rack Mount down to at least an "under the desktop" model. To my knowledge, IBM still makes quality products and has excellent customer service. They have fun names too (Rockhopper and Emperor are types of penguins!) and they even have 3D models of their rack mount cloud computers with shadows. [3] In fact, when I first read about Oxide a year or two ago, I searched for "IBM cloud server", and left it at that. So IBM, could you please send someone from the LinuxONE down to Boca Raton to create our new PC? :) Thanks!
I own a Turing Pi 2 but the hardware it is running on is proprietary. The switch isn't managed. The manament software is very archaic. Yes, it is modular and stackable and probably thousands of times more hobbyist friendly than Oxide but so is edge computing in general.
For example, this form factor looks really nice for a “hobbyist edition” or “evaluation edition”: https://zimacube.zimaboard.com/. I would probably buy an Oxide rack like this as soon as pre-orders were announced.
They won't even tell you how much a rack will cost. Infuriating typically B2B "talk to Sales so we can decide exactly how much we can get out of you and segment the market on the fly" approach persists even here, it seems.
I wouldn’t expect anything else for a full rack in this segment: it’s going to be tens or hundreds of thousands of dollars, and big enough that there will be some inevitable negotiation about prices.
That doesn't mean you can't have a thin spec builder and a pricing page, even if what that mostly gets used for is devs putting together a comparison of that to a cloud deployment or similar and taking that to the procurement department to argue it's worth opening the conversation.
Same here. I really want to work on one of these. I got in the industry at the tail end of the time when people used Sun and DEC gear. I got to use just a little bit of it and it seemed so much more "put together" then PC stuff is even now.
Oxide feels like it'll be that "integrated" experience, but with the added benefit of software freedom.
I would assume so. They've said before you can make modifications to the firmware and deploy it yourself if you so wish. That's one of the major reasons that making the firmware open source is so useful.
While working in telecom data centers circa 2016 I've seen many single rack computers from Dell, IBM, HP, Huawei... Not sure that's a new ideia, ex. the open source bits.
You own this. AWS Outpost is leased and you still also pay for the resource usage on top of the outpost unit itself. And this would not be integrated with your AWS account.
It's mostly not true that you still pay for resource usage on top of the Outpost unit. That's only true, AFAIK, for EBS local snapshots and Route 53 Resolver endpoints. The big boys--EC2 instances, S3 storage, and EBS volumes--are all "free" on Outposts. That is, included in the cost of the unit and not double-charged.
Charging for EBS local snapshots on your own Outpost S3 storage and Route 53 Resolvers on your own compute is a weird one. I don't know how they defend that. To me, it seems indefensible.
> You can purchase Outposts servers capacity for a three-year term and choose between three payment options: All Upfront, Partial Upfront, and No Upfront. … At the end of your Outposts servers term, you can either renew your subscription and keep your Outposts server(s), or return your Outposts server(s). If you do not notify AWS of your selection before the end of your term, your Outposts server(s) will be renewed on a monthly basis, at the rate of the No Upfront payment option corresponding to your Outposts server configuration.
> You can purchase Outposts rack capacity for a 3-year term … either renew your subscription and keep your existing Outposts rack(s), or return your Outposts rack(s)
Route 53 on outposts doesn’t charge for your own compute. The outpost resolver is free. The oupost resolver endpoint must be backed by an in-region resolver endpoint, which doesn’t become free just because you own an outpost. What you’re paying for is the ability to sustain high query volume from instances in your VPC to an on-prem DNS server.
This is a turn-key solution, ready to use without eventually dealing with multiple devices with its own firmware and caveats revealing after where put to work together.
The closest to that is that AWS managed rack that works with the web APIS you know already
> Ironically this looks like the realization of Richard Stallman's dream where users can help each other if something doesn't work.
that's only true if you think that "users" means "people who operate cloud computers", which is about as far from understanding what Stallman is talking about as is possible. Someone who makes SaaS and runs it on an Oxide computer is no less of a rentier capitalist than someone who makes SaaS and runs it on AWS.
I know you are half joking, but it would really be helpful to have ball-park pricing available. Are we talking Sun-level markups here, or how should we think about it? Given the enterprise sales contact form, I'm thinking yes, but I'd love to know for sure.
I am not in sales and so I hesitate to speak on it in case I am incorrect, but the way that I personally think about it is that it is true that it is not an inexpensive product: there's a LOT of computer here. But the goal is to be competitively priced.
The last time (one of them) Oxide hit HN there were some ballpark estimates based on the CPUs in use, switches etc. Someone else said 500K and up.
I wish there was a 4U version (10-25K but I don't think that they could come close to that price point - regardless even that is out of reach for me to ever get to play on one :-/ )
It's meant for orgs running at a certain scale, but you'd be surprised how early that starts making sense. AWS isn't exactly paying the economy of scale savings on to you.
True but at for companies operating at scale they not only operate on AWS but on other cloud providers as well as legacy data centers … but business wise it’s a hard sell , it may sell for couple of cycles to build a new dc or use an existing ones but then it will be back to the cloud again for many more cycles.
I've found that the scale these start making sense is only a handful of racks. We're not talking full DCs, but a room in a building or some colo space.
And beyond that, there's all sorts of weird environments that need a lot of local compute. You'd be shocked how many servers are on a cruise ship for one example among many.
They sell servers, but as a finished product. Not as a cobbled together mess of third party stuff where the vendor keeps shrugging if there is an integration problem. They integrated it. It comes with all the features they expect you to want if you wanted to build your own cloud.
Also, they wrote the software. And it's all open source. So no "sorry but the third party vendor dropped support for the bios". You get the source code. Even if Oxide goes bust, you can still salvage things in a pinch.
Ironically this looks like the realization of Richard Stallman's dream where users can help each other if something doesn't work.