Hacker Newsnew | past | comments | ask | show | jobs | submit | charles_irl's commentslogin

Thanks! I think computers are fun and I want reading about them to be fun too.

I was also reminded of HazyResearch's MegaKernels. Didn't want to distract from the main thrust of the post, but definitely think that's a promising approach.


There's some interesting work in NeurIPS this year on fused kernels for MoE too: https://flash-moe.github.io/

Hey, one of the authors here!

Reductively, software engineering means taking an idea and mapping it into code. So one form of "reverse" engineering would be taking the code and extracting the ideas. That's what we did here.

Because the source is public, there's quite a lot to work with from the start -- the warp specializations are named and there are helpful comments in many places.

But for many components, we didn't have much. Maybe the clearest case of "reverse engineering" explained in the post is with the cubic approximation for the rational part of the exponentiation. That required staring at some inline assembly and doing math.


I've never heard of this definition of reverse engineering -- when one has the unobfuscated actual source code I'd usually call it: reading the code, or something like summarization.

Not trying to be uncharitable, I found your article informative. Reverse engineering has historically been reserved for cases where there is an adversial aspect, as in binaries or server APIs. Anyhow, Cheers and thank you, sincerely.


That is the traditional explanation of why it is called reverse engineering. The term originated in hardware engineering. When it was originally applied to software, it was common to create requirements documents and design documents before coding, even if the actual process did not strictly follow the "waterfall" idea.

Thus it was natural to call the process of producing design documents from undocumented software "reverse engineering". These days coding without any formal design documents is so common that it seems the original meaning of reverse engineering has become obscured.


What time period and area did you come across this usage? As I ever saw it used, 'reverse engineering' generally referred to creating docs from executables or watching network protocols rather than from source.

Back in the 1990's. As an example, back then the Rational Rose design software had a feature to generate UML diagrams from existing source code, and it was called "reverse engineering".

https://en.wikipedia.org/wiki/IBM_Rational_Rose


Having the source code and understanding how it works is two different things, especially when running on state of the art hardware. If I had just read the source I would not have gained as much knowledge as this article taught me. Where did this extra info come from? They read the source too, but then they did something more. I wouldn’t call it summarization either, as again any summary I wrote about the code would pale in comparison.

I think "explained" is a reasonable term for this. If I remember correctly there where books of the form "The Linux Source Code Explained".

Certainly I can't get on board with reverse engineered.


That time when I reverse engineered JRR Tolkien‘s Lord of the rings from symbols engraved on dead trees. Took me three summers…

it’s more properly just software archaeology; recovering design intent from artifacts https://en.m.wikipedia.org/wiki/Software_archaeology

You've never had to reverse engineer the thinking and ideas that went behind code written by someone else/you a year ago?

No, because so far you "engineered" nothing. You just studied it, tried to understand it, and explain or teach it.

If you had reverse engineered it, you would have tried to "recreate something" that does not exist to do the same.

So, if you have a binary code, you recreate the source code that in theory could allow you to recreate the binary.

If you have the source code, I guess that would be when you are missing pieces of info that allows you to run this code like it is done by others...


Disagree that reverse engineering necessarily requires something to be recreated.

For example, simple hardware reversing can just be learning what, how and why something works, you don't need to "recreate" anything other than ideas.


You guys are being obtuse. Engineering is turning a spec into a more technical artifact, whether that's source code, machine code, physical hardware or something else. Reverse engineering is then reserving the process of engineering, recovering the semantic artifact from the engineering artifact. That the OP is using the term in the sense of recovering the semantic insights from the cuda kernels is a fine application of the concept.

I have to say this is kind of funny given that you also had this in the blog post:

> cudnn kernels are closed source, so Jensen only knows what’s going on in there.


It's the 'hacker' argument all over again.

I reverse engineered above comment by reading it and extracting the idea.

Cool paper! The authors use the fact that the M1 chip supports both ARM's weaker memory consistency model and x86's total order to investigate the performance hit from using the latter, ceteris paribus.

They see an average of 10% degradation on SPEC and show some synthetic benchmarks with a 2x hit.


This comment is a two sentence summary of the six sentence Abstract at the very top of the linked article. (Though the paper claims 9%, not 10% -- to three sig figs, so rounding up to 10% is inappropriate.)

Also -- 9% is huge! I am kind of skeptical of this result (haven't yet read the paper). E.g., is it possible ARM's TSO order isn't optimal, providing a weaker relative performance than a TSO native platform like x86?

> An application can benefit from weak MCMs if it distributes its workload across multiple threads which then access the same memory. Less-optimal access patterns might result in heavy cache-line bouncing between cores. In a weak MCM, cores can reschedule their instructions more effectively to hide cache misses while stronger MCMs might have to stall more frequently.

So to some extent, this is avoidable overhead with better design (reduced mutable sharing between threads). The impact of TSO vs WO is greater for programs with more sharing.

> The 644.nab_s benchmark consists of parallel floating point calculations for molecular modeling. ... If not properly aligned, two cores still share the same cache-line as these chunks span over two instead of one cache-line. As shown in Fig. 5, the consequence is an enormous cache-line pressure where one cache-line is permanently bouncing between two cores. This high pressure can enforce stalls on architectures with stronger MCMs like TSO, that wait until a core can exclusively claim a cache-line for writing, while weaker memory models are able to reschedule instructions more effectively. Consequently, 644.nab_s performs 24 percent better under WO compared to TSO.

Yeah, ok, so the huge magnitude observed is due to some really poor program design.

> The primary performance advantage applications might gain from running under weaker memory ordering models like WO is due to greater instruction reordering capabilities. Therefore, the performance benefit vanishes if the hardware architecture cannot sufficiently reorder the instructions (e.g., due to data dependencies).

Read the thing all the way through. It's interesting and maybe useful for thinking about WO vs TSO mode on Apple M1 Ultra chips specifically, but I don't know how much it generalizes.


My understanding is that x86 implementations use speculation to be able to reorder beyond what's allowed by the memory model. This is not free in area and power, but allows recovering some of the cost of the stronger memory model.

As TSO support is only a transitional aid for Apple, it is possible that they didn't bother to implement the full extend of optimizations possible.


Or chose not to fully implement it. Speculative execution has its share of security issues, so they may have chosen to be cautious.


based on the value speculation they do, side channel security doesn't seem to have been one of the primary goals


I’m not an expert… but it seems like it could be even simpler than program design. They note false sharing occurs due to data not being cacheline aligned. Yet when compiling for ARM, that’s not a big deal due to WO. When targeting x86, you would hope the compiler would work hard to align them! So the out of the box compiler behavior could be crucial. Are there extra flags that should be used when targeting ARM-TSO?


False sharing mostly needs to be avoided with program design. I'm not aware of any compiler flags that help here.


This raises questions.

For example, modern x86 architectures still readily out-perform ARM64 in performance-engineered contexts. I don’t think that is controversial. There are a lot of ways to explain it e.g. x86 is significantly more efficient in some other unrelated areas, x86 code is tacitly designed to minimize the performance impact of TSO, or the Apple Silicon implementations nerf the TSO because it isn’t worth the cost to optimize a compatibility shim. TSO must have some value in some contexts, it wasn’t chosen arbitrarily.

Apple Silicon is also an unconventional implementation of ARM64, so I wonder the extent to which this applies to any other ARM64 implementation. I’d like to see more thorough and diverse data. It feels like there are confounding factors.

I think it is great that this is being studied, I’m just not sure it is actionable without much better and more rigorous measurement across unrelated silicon microarchitectures.


The programs that see the most benefit of WO vs TSO are poorly written multithreaded programs. Most of the software you actually use might be higher quality than that?

> TSO must have some value in some contexts, it wasn’t chosen arbitrarily.

Ehhh. I think they might have just backed themselves into it? I believe Intel initially claimed SeqCst but the chips never implemented that and the lack was observable. TSO happened to accurately describe the existing behavior of early multicore Intel chips and they can't exactly relax it now without breaking existing binaries.

Google's AI slop claims Intel published something vague in 2007, and researchers at Cambridge came up with the TSO name and observation in 2009 ("A Better x86 Memory Model: x86-TSO").

https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tpho...


Intel initially claimed Processor Ordering that, IIRC, allows processors doing independent reads of independent writes (IRIW) to observe different orderings. This is slightly weaker than TSO.

In practice Intel never took advantage of this and, given the guarantees provided by the memory barriers, it was hard to formally recover SC, so Intel slightly strengthened it to TSO, which is what was actually implemented in hardware anyway.

I don't think intel ever claimed SC since their first CPU with builtin support for cache coherency (it was the PPro I think?), and the memory model was not well defined before that and left to external chips.


Apple M4 cpu is pretty much kimg in terms of single threaded performance. In multithreaded the M4 ultra of course loses against extreme high core count server CPUs. But I think it's wrong to say that x86 readily outperforms ARM64. Apple essentially dominates in all CPU segments they are in.


I can’t replicate that on the server on a per core basis, which is the only thing I care about.

As a key exhibit, AVX-512 native code destroys Apple Silicon. To be clear, I like and use Apple Silicon but they can’t carry a workload like x86 and they seem disinterested in trying. Super-efficient for scalar code though.


But x86_64 does outperform ARM64 in high-performance workloads. High-performance workloads are not single-threaded programs. Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux. OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.


> Maybe if Apple decides one day to manufacture the server CPU, which I believe they will not since they would have to open their chips to Linux.

They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.

> OTOH server aarch64 implementations such Neoverse or Graviton are not as good as x86_64 in terms of absolute performance. Their core design cannot yet compete.

These are manufactured on far older nodes than Apple Silicon or Intel x86, and it's a chicken-egg problem once again - there will be no incentive for ARM chip designers to invest into performance as long as there are no customers, and there are no customers as long as both the non-Apple hardware has serious performance issues and there is no software optimized to run on ARM.


> They already are open enough to boot and run Linux, the things that Asahi struggles with are end-user peripherals.

That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.

> These are manufactured on far older nodes than Apple Silicon

True but I don't think this would be the main bottleneck but perhaps. IMO it's the core design that is lacking.

> there will be no incentive for ARM chip designers to invest into performance as long as there are no customers

Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.


> Well, AWS is hosting a multitude of their EC2 instances - Graviton4 (Neoverse V2 cores). This implies that there are customers.

AWS has a bit of a different cost-benefit calculation though. For them, similar to Apple, ARM is a hedge against the AMD/Intel duopoly, and they can run their own services (for which they have ample money for development and testing) for far cheaper because the power efficiency of ARM systems is better than x86 - and like in the early AWS time that started off as Amazon selling off spare compute capacity, they expose to the open market what they don't need.


Sure, there's a different cost-benefit calculation. My argument was that there is an incentive to optimize for ARM64 because that translates to $$$. It's not only Amazon but Oracle and Microsoft too.


> That's for entertainment and for geeks such as ourselves but not realistically for hosting a service in a data center that millions of people would depend on.

Why not? Well form factor is an issue. But you can easily fit a few mac pros in a couple Us. Support is generally better then some HP or Dell servers.


Are you serious? But maybe you're not aware how such businesses are run - Linux is not officially supported by Apple and someone has to take the liability when something goes wrong, either you loose your data or your CPU melts down or whatever.


Do you think HP or Dell will take liability? Tell me you have never dealt with any large OEM without telling me you have never dealt with any large OEM. No way they will take any responsibility for loss of life, data loss, or literally anything at all. The best they do is send some cannon fodder to replace the hardware if it fails. Perhaps it's different if you have a few hundred thousands of their devices running but my experience with small operations is that it's basically impossible to deal with them.


You're misrepresenting what CPU's do exactly, and opaque term "high-performance workloads" does not help it, either. M-class chips have 256-bit (M4) and 512-bit (M3 Max) memory bus per "socket" options, as high as 1024-bit total in M2 Ultra, which is significantly higher than 64-bit and 128-bit DDR5 buses you get in x86 CPU's. For example, my relatively modern datacenter AMD EPYC 8434PN CPU (based on Zen 4c cores) is a six-channel DDR5, effectively 384-bit at 200 GB/s bidirectional bandwidth. Apple Silicon beats it by a factor of 5x. You can get somewhat better with Turin, but not by much, and at perhaps unreasonable premium.

Now, like with everything in life, of course, there's highly-specialised datapaths like AVX-512, but then again these only contribute towards single-threaded performance, & you yourself said that "High-performance workloads are not single-threaded programs." Now, as your compute network grows larger the implementation details of the memory fabric (NUMA, etc.) become more pronounced. Suffice to say the SoC co-packaging of CPU and GPU cores, along with some coprocessors, did wonders for Apple Silicon. Strix Halo exists, but it's not competitive by any stretch of imagination. You could say it's unfair, but then again, AMD MI300A (LGA6096 socket) exists, too! Do we count 20k APU's that only come in eights, bundled up in proprietary Infinity Fabric-based chassis towards "outperforming ARM64 in high-perf workloads"... really? Compute-bound is a far cry from high-performance, where the memory bus, and idiosyncrasies of message-passing are King as number of cores in the compute network continues to grow.


> Apple Silicon beats it by a factor of 5x

Really, 1TB/s of memory bandwidth to and from system memory?

I don't believe it since that's impossible from HW limits PoV - there's no such DRAM that would allow such performance and Apple doesn't design their memory sticks ...

It is also no more special with their 512-, 768- or 1024-bit memory interface since this is also not designed by them nor it is exclusively reserved to Apple. Intel has it. AMD has it as well.

However, regardless of that, and regardless of the way how you're the one skewing the facts, I would be happy to see the benchmark that shows, for example, a sustained load bandwidth of 1TB/s. Do you have one since I couldn't find it?

> You can get somewhat better with Turin

High-end Intel/AMD server-grade CPUs can achieve a system memory bandwidth of 600-700GB/s. So not somewhat better but 3x better.


> Really, 1TB/s of memory bandwidth to and from system memory?

5x is false, it's more like 4x. Apple doesn't use memory sticks, they use on-SoC dram ICs.

The M3 Ultra has 8 memory channels at 128-bit per channel for a total of 1024-bit memory bus. It uses LPDDR5-6400 so it has 1024-bit * 6400000000 bits = 819.2 gigabytes per second of memory bandwidth.


You're deceiving yourself and falling for Apple marketing. Regardless of a stick or SoC memory, which has been the case with pretty much SoC in 2010's (nowadays I have no idea), it is not possible to drive the memory with such high speeds.


This is definitely citation needed. I very much expect a combined GPU/CPU/NPU load to saturate the memory channels if necessary. This is not some marketing fluff. The channels are real, the number of RAM ICs are physically there and connected.


We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC so you're pulling in the argument that is not valid.

https://web.archive.org/web/20240902200818/https://www.anand...

> While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of.

> That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth.

> Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters)


The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.

They did not (or, rather, could not) measure the theoretical peak GPU core saturation for the M1 Max SOC because such benchmarks did not exist at the time due to the sheer novelty of such wide hardware.


> The cited article is pretty clear: the M1 Max maxes out at (approximately) 100 Gb/sec per a single CPU core, 243 Gb/sec per a CPU cluster, and 409 Gb/sec per the entire SOC.

So, which part of "We are talking about the memory bandwidth available to the CPU cores and not all the co-processors/accelerators present in the SoC" you didn't understand?


There is no need to respond with an insult if one can't hold an amicable and civilised conversation concerning an interesting technical matter.


I don't think it was an insult, at least it is not what I intended, but rather trying to make the fallacy in your response to my comment more explicit. I don't know of a better way, and I can't run circles around people in comments just to prove my point.


Conversations are almost never about proving a point; they are about exchanging ideas (however controversial or disagreeing they may be), contemplating nad debating the ideas, and drawing insights from that. A disagreement can always be expressed softly.

Personally, I learn, reflect, and gain a lot from engaging in conversations with other people, as – not infrequently – the others complement my understanding of points I might have previously not considered or missed. I call it knowledge, and knowledge is power.


Well I think we talked about memory channels and the maximum speed reachable. And you claimed it was marketing fluff. I don't think it's unreasonable to say that if that speed is reachable using some workload it's not marketing fluff. It was not clear at all to me you limited your claims to CPU speed only. Seems like a classic motte-and-bailey to me.


https://web.archive.org/web/20250125040351/anandtech.com/sho...

You're realistically going to reach power/thermal limits before you saturate the memory bandwidth. Otherwise I'd like to hear about a workload that'll make use of the CPU, GPU, NPU, etc. to make use of Apple's marketing point.


Power 10 offers that much for a while now. Per-socket. And you can join up to iirc 16 sockets together into a coherent single-linux-kernel machine.


Not sure which part of my comment you were referring to but if it was about the 1TB/s of mem bw it seems it is rather 409GB/s per-socket.

From https://www.ibm.com/support/pages/ibm-aix-power10-performanc...

> The Power10 processor technology introduces the new OMI DIMMs to access main memory. This allows for increased memory bandwidth of 409 GB/s per socket. The 16 available high-speed OMI links are driven by 8 on-chip memory controller units (MCUs), providing a total aggregated bandwidth of up to 409 GBps per SCM. Compared to the Power9 processor-based technology capability, this represents a 78% increase in memory bandwidth.

And that is again a theoretical limit which usually isn't that interesting but rather it's the practical limit the CPU is able to hit.


For one, OMI is just like PCIe full-duplex; second, OMI-with-DDR4-3200 is substantially lacking in throughput vs. e.g. GDDR6 that was shown in the early Power10/OMI slides.

Also there's to note the substantial overprovisioning of the lanes to handle lane-localized transmission issues without degrading observed performance.


You're right, I looked it up, the hardware limit is actually 800 GB/s for M2 Ultra. You're also right that the actual bandwidth in real workloads is typically lower than that due to the aforementioned idiosyncrasies in caches, message-passing, prefetches, or lack thereof, etc. The same is the case for any high-end Intel/AMD CPU, though. If you wish to compare benchmarks, a single most relevant benchmark today is LLM inference, where M-series chips are a contender to beat. This is almost entirely due to combination of high-bandwidth, high-capacity (192 GB) on-package DRAM, available to all CPU and GPU cores. The closest x86 contender is AMD Strix Halo, and it's only somewhat competitive in high-sparsity, small MoE setups. NVIDIA were going to produce a desktop one based on their Grace superchip, but it turned out to be a big nothing.

Now, I'm not sure whether it's genuine to compare Apple Silicon to AMD's Turin architecture, where 600 GB/s is theoretically possible, considering at this point you're talking about 5K euro CPU with a smudge under 600W TDP. This is why I brought up Sienna, specifically, which is giving comparable performance in comparable price bracket and power envelope. Have you seen how much 12 channels of DDR5-6400 would set you back? The "high-end AMD server-grade," to borrow your words, system—would set you back 10K at a minimum, and it would still have zero GPU cores, and you would still have a little less memory bandwidth than a three-year old M2 Ultra.

I own both a Mac studio, and a Sienna-based AMD system.

There are valid reasons to go for x86, mainly it's PCIe lanes, various accelerator cards, MCIO connectivity for NVMe stuff, hardware IOMMU, SR-IOV networking and storage, in fact, anything having to do with hardware virtualisation. This is why people get "high-end" x86 CPU's, and indeed, this is why I used Sienna for the comparison as it's at least comparable in terms of price. And not some abstract idea of redline performance, where x86 CPU's by the way absolutely suck in a single most important general purpose task, i.e. LLM inference. If you were going for the utmost bit of oompf, you would go for a superchip anyway. So your choice is not even whether you're getting a CPU, instead it's how big and wide you wish your APU cluster to be, and what you're using for interconnect, as it's the largest contributing factor to your setup.

Update: I was unfair in my characterisation of NVIDIA DGX Spark as "big nothing," as despite its shortcomings, it's a fascinating platform in terms of connectivity: the first prosumer motherboard to natively support 200G, if I'm not mistaken. Now, you could always use a ConnectX-6 in your normal server's PCIe 5.0 slot, but that would already set you back many thousands of euros for datacenter-grade server specs.


Memory bandwidth is just a marketing term for Apple at this point. Sure, the bus is capable of reaching that bandwidth, but how much can your code actually use? You'd be mistaken if you think the CPU can make use of all that bandwidth, or even the GPU!


It's solely dependent on the workload's memory access patterns. The higher you go in thread count, the more you're constrained by contention, caches, etc. The paper in OP is demonstrating how relatively subtle differences in the memory model are leading to substantial differences in performance on actual hardware. The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time. M-series processors have packaging advantage that is very hard to beat, and indeed, is yet to be beat—in consumer and prosumer segments.

See my reply to adjacent comment; hardware is not marketing, and LLM inference stands to witness.


> The same as having lots of FLOPS on paper doesn't necessarily mean you'll get to use all that compute, if you're waiting on memory all the time.

The opposite case is also possible. You can be compute limited. Or there could be bottlenecks somewhere else. This is definitely the case for Apple Silicon because you will certainly not be able to make use of all of the memory bandwidth from the CPU or GPU. As always, benchmark instead of looking at raw hardware specifications.


> […] but how much can your code actually use?

All of it, and it is transparent to the code. The correct question is «how much data does the code transfer?»

Whether you are scanning large string ropes for a lone character or multiplying huge matrices, no manual code optimisation is required.


Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths? I'm only familiar with Apple Silicon-specific code in llama.cpp, and not really familiar with either Accelerate[0] or MLX[1] specifically. Have they really cracked it at homogenous computing so that you could use a single description of computation, and have it emit efficient code for whatever target in the SoC? Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

[0]: https://developer.apple.com/documentation/accelerate

[1]: https://ml-explore.github.io/mlx/build/html/usage/quick_star...


> Are you well-read enough into the platform so that you can attest to it requiring no manual code optimisation for high-performance datapaths?

Yes.

> Have they really cracked it at homogenous computing […]

Yes.

> have it emit efficient code […]

Yes. I had also written compilers and code generators for a number of platforms (all RISC) decades before Apple Silicon became a thing.

> […] for whatever target in the SoC?

You are mistaking the memory bus width that I was referring to for CPU specific optimisations. You are also disregarding the fact that the M1-4 Apple SoC's have the same internal CPU architecture, differing mostly in the ARM instruction sets they support (ARM64 v8.2 in M1 through to ARM64 v8.6 in M4).

> Or are you merely referring to the full memory capacity/bandwidth being available to CPU in normal operation?

Yes.

Is there truly a need to be confrontantial in what otherwise could have become an insightgul and engaging conversation?


Have you tested it or is that just what you expect?


Yes, I have actually tested it.

Others also have. The https://lemire.me/blog/ blog has a wealth of insights across multiple architectures, which include all of the current incumbents (Intel, Apple, Qualcomm, etc.)

Do you have any detailed insights? I would be eager to assess them.


There is also IBMs POWER11, with regard to memory bandwidth :)


It's quite impressive what they were able to achieve with ppc64el in recent years, including Linux support for it, too. Unfortunately, they turned the wrong way with proprietary encryption of memory, which may or may not be deliberate as far as backdoors come and go, but in all honesty so much in it is contingent on IBM's proprietary fabric (OSC or what was it?) implementation for tiered memory anyway. There's similar setups from Samsung, even including fully transparent swapping to NVMe for persistence which is really cool, and hard to match in open source setting.

I think their slogan could be "unlimited, coherent, persistent, encrypted high-bandwidth memory is here, and we are the only ones that really have it."

Disclaimer: proud owner of thoroughbred OpenPOWER system from Raptor


Yeah and outside benchmarks you have to consider the power envelope and platform on top which is definitely out somewhere on its own.


I’ve seen the stronger x86 memory model argued as one of the things that affects its performance before.

It’s neat to see real numbers on it. Didn’t seem to be very big in many circumstances which I guess would have been my guess.

Of course Apple just implemented that on the M1 and AMD/Intel had been doing it for a long time. I wonder if later M chips reduced the effect. And will they drop the feature once they drop Rosetta 2?


I'm really curious how exactly they'll wind up phasing out Rosetta 2. They seem to be a bit coy about it:

> Rosetta was designed to make the transition to Apple silicon easier, and we plan to make it available for the next two major macOS releases – through macOS 27 – as a general-purpose tool for Intel apps to help developers complete the migration of their apps. Beyond this timeframe, we will keep a subset of Rosetta functionality aimed at supporting older unmaintained gaming titles, that rely on Intel-based frameworks.

However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?

https://developer.apple.com/documentation/virtualization/run...

I wouldn't be surprised if they really do drop some x86 amenities from the SoC at the cost of performance, but I think it would be a bummer of they dropped Rosetta 2 use cases that don't involve native apps. Those ones are useful. Rosetta 2 is faster than alternative recompilers. Maybe FEX will have bridged the gap most of the way by then?


> However, that leaves much unsaid. Unmaintained gaming titles? Does this mean native, old macOS games? I thought many of them were already no longer functional by this point. What about Crossover? What about Rosetta 2 inside Linux?

Apple keeps trying to be a platform for games. Keeping old games running would be a step in that direction. Might include support for x86 games running through wine/apple game porting toolkit/etc


> Apple keeps trying to be a platform for games. Keeping old games running > would be a step in that direction. Might include support for x86 games > running through wine/apple game porting toolkit/etc

Well... They'd need to bring back 32-bit support also then. This is what killed most of my Mac-compatible Steam library....

And I do not see that happening.


They dropped rosetta 1, what makes you think they will keep supporting this one?


Rosetta 1 was licenced third party technology back when the company wasn't exactly rolling in money.

https:/www.wikipedia.org/wiki/QuickTransit

If you have to pay the licensing fee again every time you want to release a new version of the OS, you've got a fiscal incentive to sunset Rosetta early.

Rosetta 2 was developed in-house.

Apple owns it, so there is no fiscal reason to sunset it early.


> Apple owns it, so there is no fiscal reason to sunset it early.

Except not having to pay to maintain it.


> so there is no fiscal reason to sunset it early.

Silicon (or verification thereof) isn't free.


Rosetta 1 wasn't really useful for much because PowerPC was a dead platform by the time Apple switched off of it. Rosetta 2 is used for much more than just compatibility with old macOS apps.


I think they’re trying to maintain the stick for ordinary “Cocoa” app developers, but otherwise leave themselves the room to keep using the technology where it makes sense.


Author here! Worked on this project generating QR codes with diffusion models and ensuring they actually scan.

Ended up having to manually scan a few thousand generated codes by hand in order to build up a dataset to evaluate fully automated systems against.

We landed on QReader. It was such a good match for a human with an iPhone that we were able to use it to scale up inference -- basically, generate eight codes, each of which has a ~30% chance of scanning, then show the user any that did scan. That gives you a (1-0.3)^8 ~= 5% chance of failure.


I used to print cards that had an image (say photo or art reproduction on the front) and had a QR code and documentation in the back.

I like to stick the cards to the wall in which case you can't scan the QR code without removing it, so lately I've transitioned to alpha blending the QR code into the image on the front and get results like

https://mastodon.social/@UP8/114439589867642821

I've thought about how to optimize the process in terms of scan reliability vs the image looking good but I also have thought about communicating to people that the QR code is an affordance that is there for them. I think people have a lot of understanding that you can scan a QR code and the alpha blended QR codes are recognizable to most people.

Those diffusion ones though I think might get a reaction from most people that "this is fucked up". People who are enthusiastic about AI images don't recognize that a lot of people have a visceral negative reaction to them or feel they are under assault from trash images. AI image enthusiasts oddly don't seem to be bothered that they'll draw, say, a picture of a pretty girl with three belly buttons, but a lot of people find the errors in those images offputting -- so if a QR code looks melted like a Dali painting people might think "the lights are on and nobody is home" and not think "my phone will scan it just fine".

Also I think it's a mistake to be expecting users to be using an iPhone, there are plenty of people out there with an Android with a crummy camera and off-brand QR reader software so I think it's best to be conservative about images you make.


That's a fair point! I was thinking of the Yamaha chips in the Sega consoles mentioned in that comment -- which certainly defined the sound of the 1990s for me as a child. But my small town Midwestern up-bringing was behind the curve!

Will replace with the lore-accurate "late 1900s".


We looked into this at Modal! We put out vGPUs but didn't see demand and our internal benchmarks for MPS and Green Contexts didn't indicate a big win.

The tricky thing here is that many GPU workloads saturate at least one of the resources on the GPU -- arithmetic throughput, memory bandwidth, thread slots, registers -- and so there's typically resource contention that leads to lowered throughput/increased latency for all parties.

And in a cloud (esp serverless/auto-scaling) computing context, the variety of GPU SKUs means you can often more easily right-size your workload onto whole replicas (on our platform, from one T4 up to 8 H100s per replica).


When I'm feeling sassy, I like to tell people that Modal is "Enterprise Java Beans for AI".


Tomcat wanted to be some sort of compile once, run anywhere Docker.


We've talked to them and there's some impressive technology there!


Nice article! I had to restrain myself from ranting on our blog :)


Oh, I wrote this! Thanks for sharing it.


Anything you feel is worth adding for the HN crowd while you've got our attention? :)

(Thanks for writing this btw!)


Hmm, hard to say!

In the few months since I originally wrote this, I've come to an even greater appreciation of just how hard it is to maximize utilization of the Tensor Cores. It's a lot more than just kernel parameter tuning and using a few parallel programming tricks (parallel reduce, unrolling). It really borks your CUDA code -- you need warp specialization, you need to break warp uniformity, you need to work with explicit asynchrony. Hoping to write about this for the Modal blog/GPU Glossary soon!

I also spent a bit of time working with ncu/"NSight Compute". I'd probably include a bit about it in the section on how to improve your MFU if I rewrote the article today. But tl;dr use the profiling tool, Luke! And a good way to learn is to watch NVIDIA's GTC talks.

That said, I've also noticed even more cases where GPU kernel utilization is well below target. I think (and Horace He has argued) that that comes in part from optimized GEMMs running so fast on Tensor Cores that host overhead becomes the bottleneck (classic Amdahl). This unfortunately means more host logic needs to be compiled -- either graph-compiled as in torch.compile or moved into a compiled language.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: