AWS unveils Graviton4 & Trainium2

snewman · on Nov 28, 2023

> Graviton4 processors deliver up to 30% better compute performance, 50% more cores, and 75% more memory bandwidth than Graviton3.

This seems ambiguous. Presumably this is 50% more cores per chip. What about "30% better compute performance" and "75% more memory bandwidth": is that per core, or per chip? If the latter, then per-core compute performance would actually be lower.

Also, "up to" could be hiding almost anything. Has anyone seen a source with clearer information as to how per-core application performance compares to earlier Graviton generations?

otterley · on Nov 28, 2023

AWS Graviton specialist here!

The performance improvement is on a per-core basis. The pending availability of 96-vCPU Graviton4 instances is icing on the cake!

Klasiaster · on Nov 29, 2023

Do any of the (soon-to-be) available non-metal offers support nested virt?

kevincox · on Nov 28, 2023

I would assume that "up to" means that for all of the workloads that they benchmarked the best result was 30% better compute performance. Not a very useful number as your workload is very unlikely to hit the right set of requirements to see that uplift.

stingraycharles · on Nov 30, 2023

I read it as: each core is 30% faster, each chip has 50% more cores, and each chip has 75% more memory bandwidth.

But you’re right it is ambiguous.

p1esk · on Nov 28, 2023

Wait, how could "50% more cores, and 75% more memory bandwidth" result in anything less than 50% of better compute performance?

snewman · on Nov 29, 2023

Because AWS doesn't rent you chips, it rents you cores. If a chip has 50% more cores, that doesn't help you as a renter of cores, just means they get to rent out more cores.

Of course it might give Amazon room to offer these instances at a lower hourly rate per core, which would ultimately cash out as improved cost/performance for AWS customers.

jagger27 · on Nov 28, 2023

Clock speed

bluedino · on Nov 28, 2023

It's the best Graviton processor yet!

PedroBatista · on Nov 28, 2023

What happens to old&used Graviton 3 chips?

At least in the "old days" there was ( still is ) a secondary market for used server parts..

Don't know how companies like Amazon, Microsoft and Google would frame a question like this so their "green" narratives wouldn't be hurt but I'm sure they'll do an excellent job.

aseipp · on Nov 28, 2023

They don't sell these. They reuse them and perform maintenance on them until their last breath and part them out once they die.

Hyperscalers design their own datacenter "SKUs" for storage/compute, all the way from power delivery to networking to chassis. These servers are going to be heavily customized and it's unlikely that even if they fit normal form factors that they will work in the same way as COTS devices or things you would buy from Supermicro.

You could possibly make it work. If they sold them. But they don't, and if you're in the market for that stuff, Supermicro will just design it for you anyway, because presumably you have actual money.

And the reality is they're probably either break even or greener doing it this way, as opposed to washing their hands of it and selling servers on Ebay so they can eventually get throw in landfills wholesale by nerds once their startups fail or they get bored of them. Just because you stick your head in the sand doesn't mean it doesn't end up in a landfill.

stingraycharles · on Nov 30, 2023

I haven’t heard about CPUs failing that often, though. Usually it’s some other part of the server that dies, like the motherboard. In that light, the grandparent’s question is still valid — normally these servers that “died” would be torn apart and the non-broken parts refurbished and resold on the aftermarket.

Is AWS doing that?

aseipp · on Nov 30, 2023

No, but I spat out a vague answer rather quickly and was too flippant ("maybe you could do something"), so it's a fair question. Realistically, even the motherboard design, including landing pad on the PCB and boot sequence of the chip, from the root of trust to initial firmware bringup, is going to be custom on systems like Graviton4. For example, these use the Nitro system, which exists as hardware, and it is a key point of the whole design. And AWS designs their services to even resist some level of operator compromise, e.g. an operator trying to exfiltrate secrets from the Nitro system, so the amount of people who can exert influence there is extremely limited. Individual parts like the CPU are as good as useless without the chassis (and power supply, and attached switch equipment) they belong to. Even if you had the whole thing, you might very well not be able to do anything with it, making it as good as a brick.

Even if Nitro was out of the picture or whatever, and you just had the raw package -- it's not like you can really make a motherboard magically from thin air for these devices based on just the CPU pinout, and the tolerances just for power delivery and memory buses are pretty tight, not to mention a gazillion other things.

More broadly, designing compute that is used purely in-house versus large-scale high-volume COTS designs, through e.g. OEM partners, is literally a difference of years and tens or hundreds of millions of dollars. Support, documentation, supply chain relationships, etc. These take a lot of money to do right, and when you buy servers, part of the purchase goes to those departments, to fund them. Most places are better off just talking to Supermicro if they actually need servers, for that reason. But hyperscalers literally save ridiculous amounts of money by doing it themselves and not doing the other things Supermicro does, like OEM work, support, and NRE on generalist designs that are useful outside to third parties.

asperous · on Nov 28, 2023

If you haven't used aws a lot then you might not know this but the old instance types stick around and you can still use them, especially as "spot" which lets you bid for server time.

I had a science project which was cpu bound and it turns out because people bid based on the performance, the old chips end up costing the same in terms of cpu work done/$ (older chips cost less per hr but do less).

aws though was by far the most expensive so switching to like oracle with their ampere arm was a lot cheaper for me.

baz00 · on Nov 28, 2023

They're still running prehistoric Intel Xeons. I'm sure they'll just rot slowly until the instances fail.

jamesfinlayson · on Nov 28, 2023

This - I've been seeing recent Fargate workloads mysteriously scaling due to high CPU even though there's no traffic. I started logging the CPU as part of task start-up and I've seen five year old Xeons running my workloads.

dilyevsky · on Nov 29, 2023

5 years aint too bad, on google their standard n1 class still runs Haswell in some cases which came out 10+ years ago

jamesfinlayson · on Nov 29, 2023

Ouch. Possibly some are older - there were a few processor names that I couldn't find any release date for.

imglorp · on Nov 29, 2023

Who cares, if it's priced appropriately? If I just need a generic x86 t2.nano or whatever, why not?

They will run it til the card dies, chuck it in the bin when it does, and until then they can pass some of the savings to users hopefully, win win?

jamesfinlayson · on Nov 29, 2023

The price is the same though, regardless of what I'm getting, and I wouldn't care, except in my non-prod environments everything runs fine on some class of processor, while on my prod environment things didn't run fine and my cluster was maxed out because it was running on some old processor. I know that notionally it can and will run on different hardware in different environments, but if I can run a certain workload (idling at 10% in one environment), I expect to be able to do the same in another environment.

HPsquared · on Nov 29, 2023

Not great for energy efficiency, I guess. They might eventually dispose of them if running costs are too high compared to more modern chips.

dilyevsky · on Nov 29, 2023

I ran cpu benches there a while ago and it actually performed better wrt single core performance than the newer shit so not all bad

cjsplat · on Nov 28, 2023

Depending on the numbers involved, previous generation hardware can waterfall to infrastructure apps that are throughput based.

Things accessed through network APIs and billed per op or in aggregate. Distributed file systems, databases, even build and regression suite systems.

Another key point is that older generations of servers for full custom cloud environments tend to co-evolve with their environments. The amount of power and cooling for a rack may not support a modern deployment.

Especially if a generation lasts 6 years. You might be able to cascade gen N+1 to N, but N+6 may require a full retrofit. A 6 year old data center that is partially filled as individual servers fail may justify waiting for N+7 or even 8 to cover the cost of the downtime and retrofit.

There is a reason Google announced that they are depreciating servers over 6 years and Meta is at 5 years, vs the old accounting standard of 3 years.

Then of course there is a secondary market for memory and standard PCI cards, but the market for 6 year old tech is mainly spares, so it is unlikely to absorb the full size of the N-6 year data center build.

If you are considering a refurb style resale market for 6 year old tech, it is often the case that the performance per dollar is a non-starter because of the amount of power the older tech consumes.

rstupek · on Nov 28, 2023

They'll continue to run in their datacenter since they're still basically brand new?

whalesalad · on Nov 28, 2023

I can't wait to find old surplus custom ARM silicon from this period at the recycler or on eBay.

As a kid I always wanted one of those yellow google search appliances and now you can find them everywhere being used as like lawn ornaments.

threeseed · on Nov 28, 2023

As a user you don't get much visibility into the specs of managed services e.g. DynamoDB.

So that's an obvious home for the chips that are no longer available to users.

discodave · on Nov 28, 2023

They just... don't retire them? The most expensive thing in a DC is the chips, so it's worth it to just build more datacenter space and keep the old ones around.

In 2019, before I left the EC2 Networking / VPC team, we were using M3 instances for our internal services... those machines were probably installed in 2013 or 2014, making them over 5 years old.

With the slowdown in Moore's law and chip speeds, I'd wager that team is still using those M3s now.

Eventually the machines actually start failing, so they need to be retired, but a large portion of machines likely make it to 10 years.

aurareturn · on Nov 29, 2023

Isn't the most expensive thing power and cooling?

IE. if you replace the 5 year old Xeons with new ultra-efficient ARM chips, wouldn't that save you more power and cooling over x amount of time?

Genuine question.

kaliszad · on Nov 29, 2023

According to James Hamilton who is a distinguished engineer at AWS no: https://perspectives.mvdirona.com/2010/09/overall-data-cente... and https://perspectives.mvdirona.com/2008/12/annual-fully-burde...

And you can watch him say it too: https://youtu.be/kHW-ayt_Urk?si=DKyw0-Pk-dhU5zFG&t=323

aurareturn · on Nov 29, 2023

The first link refers to "servers" - not "CPUs". I assume "servers" contains everything server hardware related, not just the CPU.

temp0826 · on Nov 28, 2023

They for sure can find a use internally for them. Hat-tip to the less-shiny teams like glacier that have to endlessly put out fires on dilapidated old s3 compute/array handmedowns.

fhub · on Nov 28, 2023

Not much to discuss until there is pricing. I have a bunch of Graviton2 instances that didn't make sense to upgrade to any Graviton3 instances due to pricing bump for 16GB 4 cores (t4g.xlarge).

LunaSea · on Nov 28, 2023

Since Graviton3 still isn't available in most regions, especially on the RDS side, I'm really not holding my breath.

aseipp · on Nov 28, 2023

Neoverse V2, so this will be probably be the first widely available ARMv9 server with SVE2, a server-class SKU you can actually get your hands on (i.e. not a mobile phone/Grace/Fugaku.) It's about damn time!

resters · on Nov 30, 2023

In my opinion the key takeaway is that compute is becoming commoditized much more rapidly than anyone expected, and that IP is becoming less and less relevant compared to fabrication, energy and land. For consumers of cloud infrastructure, there is little concern how many teraflops per cubic meter or per watt hour other than very locality-specific edge use-cases.

Laptops and phones are are already a SoC with IO in a particular form factor, and sever farms will go in the same direction with minor differences in the energy or rack density that come out in the wash.

ThinkBeat · on Nov 30, 2023

It feels a bit weird with MS, AWS probably other cloud owners develop their own CPUS and AI oriented chips and telling the world about the specs.

Yet nobody will ever get noe to play with them in RL. I cant hope to buy one a year from now and stuff it in my home office.

What will all this mean for consumer oriented cpus?

Would it be accurate to say that Intel funds part of the development of consumer cpus with the server cpus? (or is it the other way around). It seems like Xeon chip advances drip downwards after a while.

If AWS and Azure stop buying chips from Intel and AMD, presumably that woud be interesting.

snird · on Nov 29, 2023

One of the employees in the team developing Graviton4 is kidnapped by Hamas in Gaza: https://www-geektime-co-il.translate.goog/amazon-aws-employe...

insanitybit · on Nov 30, 2023

> This web property is not accessible via this address.

ilaksh · on Nov 28, 2023

Do you need specific software to train a model using Trainium2? For example, what about fine-tuning a language model? Will something like QLoRA work?

monlockandkey · on Nov 28, 2023

What Arm core is Graviton 4 using? 30% performance uplift is a good amount

aeyes · on Nov 28, 2023

> Neoverse V2

https://aws.amazon.com/blogs/aws/join-the-preview-for-new-me...

cherioo · on Nov 28, 2023

Likely Neoverse V2 architecture, based on A710 cores

monlockandkey · on Nov 28, 2023

Isn't it based on Cortex X3?

ksec · on Nov 29, 2023

It is interesting how late the Cortex X3, ( Neoverse V3 or Graviton 4 ) arrives on server when the X4 is already being used and close to shipping in millions within months or weeks.

End of Next Year we will get 128 Core Neoverse V4 / Cortex X4 with 3nm. And 3nm Zen 5 EPYC.

Server CPU Market is getting quite exciting.

jauntywundrkind · on Nov 30, 2023

Any guesses what the various chips on the package are?

I'd guess maybe the two directly abutting the core are memory controllers, but maybe they are the stacked memory? Maybe the top and bottom chips are io controllers? It felt like destiny that eventually Nitro was going to be on-package, maybe those are basically big honking nitro-like chips?

buildbot · on Nov 28, 2023

The scale they are quoting at 100,000 chip clusters and 65 exaflops seems impossible. At 800W per chip, that's 80MW of power! Unless they literally built an entire DC of these things, nobody is training anything on the entire cluster at once. It's probably 10-20 separate datacenters being combined for marketing reasons here.

tempay · on Nov 28, 2023

What makes you think it's 800W per chip?

buildbot · on Nov 28, 2023

It's about what the I though the H100 was, that's 700W actually. But even at say, 400W, that's 40MW of power. I guess some datacenters are built in the 40-100MW range from some quick googling, but I really doubt they actually can network 100,000 chips together in any sort of performant way, that's supercomputer level interconnect. I don't think most datacenters support highly interlinked network interconnect like this would need either.

tempay · on Nov 28, 2023

They have instances with 16 chips so I presume there are at least 16 chips per server. I'd also expect the power consumption to be more like 100-200W given they seem more like Google's TPUs than a H100.

For the interconnect I doubt this is their typical interconnect but it doesn't seem completely unreasonable. Even when not running massive clusters they'll still need the interconnect to pair the random collections of machines that people are using.

buildbot · on Nov 29, 2023

I don’t know - apparently they are watercooling this gen: https://www.servethehome.com/aws-graviton4-is-an-even-bigger...

You don’t watercool 200W chips typically, and you can in theory air cool 8x 800 watt nvidia h100s in a single system. These are also 4-5u systems!

16 chips in one node would be ambitious, I would expect the 16 chip offering to really be several closely located nodes in the same rack/nearby.

tempay · on Nov 29, 2023

> 16 chips in one node would be ambitious, I would expect the 16 chip offering to really be several closely located nodes in the same rack/nearby.

I'd expect it to be like Google's TPUs which have 4 "chips" in a "pod", attaching 4 of these pods to a single system doesn't seem unreasonable.

Looking at the corrosponding CPU and RAM of the available instance types it looks like they're using 32-core CPUs in dual socket systems.

buildbot · on Nov 29, 2023

Yeah that seems like a likely setup to me!

bluedino · on Nov 28, 2023

Per server? Our dual CPU intel servers take about 800-900W at full power

buildbot · on Nov 29, 2023

Per chip - they are are watercooling these babies: https://www.servethehome.com/aws-graviton4-is-an-even-bigger...

(Just like we had to do at microsoft for maia with the sidekick rack of just cooling: https://www.datacenterfrontier.com/machine-learning/article/...)

bradknowles · on Nov 29, 2023

Well, think of it this way -- individual 1U servers can easily consume 1000W, or 1kW. Put about forty of those in a single rack, and that's 40kW. Divide 80MW for the datacenter by 40kW per rack and that's not very many racks to comprise the entire datacenter, right?

buildbot · on Nov 29, 2023

That’s still 2000 racks, that’s not nothing?

bradknowles · on Nov 29, 2023

No, 2000 racks is not nothing.

But I would say that's a pretty small datacenter, wouldn't you?

seec · on Nov 29, 2023

The footprint for 2000 racks would be over 1000m2; when you add the necessary spacing as well as supplementary utilities (power/networking) that probably means double that footprint.

I guess at the scale those companies are operating it's not that big, but that's still quite a large building !

buildbot · on Nov 29, 2023

I actually have no idea to be honest!