One challenge was that while I started working on the Xbox 360 about three years...

Zeetah · 2024-12-20T00:20:24 1734654024

You have an awesome memory, Dinarte!

Eric Mejdric from IBM called on Friday and said we have the chips, when are you guys getting here?

I took a red eye that night and got to Austin on Saturday morning.

We brought up the board, the IBM debugger, and then got stuck.

I remember calling you on Sunday morning. You had just got a big screen TV for the Super bowl and had people over and in-between hosting them you dropped us new bits to make progress.

I think Tracy came on Sunday or Monday and with you got the Kernel booted.

Good times!

This is Harjit by the way.

Edit: added super bowl.

saturn8601 · 2024-12-20T00:55:45 1734656145

OMG Harjit! I saw you in the documentary! You and the entire team are total rockstars! I just cannot fathom ever being in a position to design something that provided so much joy and happy core memories to countless people around the world...you guys did it!

Just the thought of how many people you touched with your work....just amazing! :)

Had a question if you don't mind: Can you talk about the thought process behind the power supply design? Its very large even in the super slim models. Were you following a specific design driven by the hardware architecture or were there other reasons? I always wondered about that.

maroonblazer · 2024-12-20T01:33:48 1734658428

>I saw you in the documentary!

I presume you're referring to this one: https://www.xbox.com/en-US/power-on#watch

saturn8601 · 2024-12-20T02:59:32 1734663572

Yes! It is worth watching every second. What an amazing production.

dinartem · 2024-12-20T02:01:58 1734660118

Actually Tracy never made it to Austin. He was going to fly in later in the week to continue bring-up, but since we were done by Wednesday, we just sent the chips to Redmond and he continued there. He was of course always available on the phone to answer my kernel questions I had.

markus_zhang · 2024-12-20T01:34:19 1734658459

This is really some blast from the past. Can you please shed more light on the simulator? Is it interpretation or JIT? But then I realize XBOX uses Pentium III, so maybe virtualization? Edit Sorry it was XBOX 360 so it's not Pentium.

As someone who recently got interested in emulation and wrote two lc-3 emulators, would really love to learn from the masters.

brokenmachine · 2024-12-19T23:26:04 1734650764

Kick ass. This kind of post is why HN is the best.

fragmede · 2024-12-19T22:48:17 1734648497

damn, that's bad ass. did that simulator run on a Windows system or was it something more esoteric?

dinartem · 2024-12-19T23:04:11 1734649451

I called the simulator Sbox and it was just a simple console app. I didn't implement the GPU, so no graphics just the hypervisor and kernel and some simple non-graphics apps. I made it so that you could build the Xbox 360 kernel on your windows machine, then just run sbox.exe and it would automatically find the just built kernel image targeting the PPC64 and boot it. Then if you typed control-C it would drop into the kernel debugger as a sub process, and you could poke around at the machine state as if it were the real Xbox hardware, showing all the PPC instructions and registers. It was a lot of fun writing it, and quite useful.

Zeetah · 2024-12-20T00:10:21 1734653421

You should also talk about the lwarx/stecx bug. IIRC - in the first version of the chip there was a bug in one or both of these instructions. Your code booted on SBox but didn't on the hardware. You compared the two and then figured out it was these instructions.

You filed a bug report and then dug into them and used SBox to figure out what must have been going wrong.

The chip supplier came back with a workaround and within five minutes you simulated it on SBox and said it wouldn't work, why, and then said how it should be fixed.

The supplier didn't believe you as yet. And you worked out a workaround so we could be unblocked. Two weeks later they agreed with your fix...

maximilianburke · 2024-12-20T01:18:20 1734657500

I recall an issue when trying to use lwarx/stwcx on Xbox 360 directly that the compiler (or maybe even the kernel, on program load? it's been a while) raised an error and said to use the Interlocked intrinsics instead -- is that related?

dinartem · 2024-12-20T02:59:20 1734663560

So the PPC instruction set uses lwarx (load word and reserve indexed), and stwcx (store word conditional indexed), along with variations for word size, to implement atomic operations such as interlocked-increment and test-and-set.

So on PPC interlocked-increment is implemented as:

loop: lwarx r4,0,r3 # Load and reserve r4 <- (r3) addi r4,r4,1 # Increment the value stwcx. r4,0,r3 # Store the incremented value if still reserved bne- loop # Loop and try again if lost reservation

The idea is that the lwarx places a reservation on an address that it wants to update at some later time. It doesn't prevent any other thread or processor from reading or writing to that address, or cause any sort of stall, but if an address being reserved is written to, conditional or otherwise, then the reservation is lost. The stwcx instruction will perform the store to memory if the reservation still exists clears the NE flag, otherwise it doesn't do the write and sets the NE flag and software should just try again until it succeeds.

On the Xbox 360 we provided the compiler which would emit sequences like these for all atomic intrinsics, but developers could also write assembler code directly if they wanted to. We'll get back to this point in a moment.

As the V1 version of the Xbox 360 CPU was being tested by IBM, they discovered that an error with the hardware implementation of these two instructions and issued an errata for software to work around it, which we implemented. Unfortunately, after further testing IBM discovered that the errata was insufficient, so issued a second errata, which we also implemented and assumed all was well.

Then the V2 version of the CPU comes out and months go by. But early one morning I get a phone call from IBM letting me know that the latest errata was still insufficient and that the bug is in the final hardware. Further, Microsoft has already started final production of CPU parts, even before full testing was fully complete (risk buy), so that they could have sufficient supply for the upcoming November release. I was told that they are stopping manufacturing of additional CPUs, and that I had 48 hours to figure out if there is anything software can do that could work around the hardware issue. They also casually mentioned that millions of dollars of parts would need to be discarded, a hardware fixed implemented which would take weeks, then the production could resume from scratch.

Bottom line is that, yes, there was a set of software changes that would work around the bug, but it required very specific sequences of instructions, the disabling of interrupts around these sequences, a change to the hypervisor, and updating the compiler to emit the new sequences. To make sure that developers didn't introduce code sequences that uses lwarx/stwcx in a way that would expose the bug (via inline assembly, for example), the loader would scan the code and refuse to load code that didn't obey the new rules.

Interesting fact: the hardware bug existed in every version of the Xbox 360 ever shipped, because software needed to run on any console ever shipped, there was no advantage to ever fixing the bug since software always needed to work around it anyway.

markus_zhang · 2024-12-20T04:35:36 1734669336

Thank you so much. This is so awesome to know and learn.

I'm just curious, what are the instructions that replace the lwarx/stwcx "atomic" pair? From my understanding, basic you need to generate a pair of load reserved/save instructions, and you have to replace the pair with a series of instructions. But I don't understand why do you have to disable interrupts -- is it because actually multiple instructions were used to facilitate the load, and an interrupt may disturb a value stored in a register?

Sorry I know little about assembly and arch.

dinartem · 2024-12-20T13:06:36 1734699996

We still used the lwarx/stwcx pair to implement atomic operations, but to avoid the hardware bug a strict rule needed to be followed.

Rule: On a given hardware thread (there are two hardware threads per processor on the Xbox 360), every lwarx reservation of an address must be paired with a stwcx conditional store to that same address before a reservation is made to a different address. So a sequence like lwarx A / lwarx B / stwcx B / stwcx A is forbidden. But lwarx A / stwcx A / lwarx B / stwcx B is fine.

So I changed the compiler to emit atomic intrinsics that obeyed this rule.

But there was still the issue of logical thread scheduling. Imagine there are two logical threads running, one has a sequence of lwarx A / stwcx A and the other has lwarx B / stwcx B. The first thread is running on a hardware thread and just after executing lwarx A, the timer interrupt fires and the kernel decides to switch to the second logical thread, which executes lwarx B, and thus violates the rule.

To make sure that never happens, the compiler also emits disable-interrupts / lwarx A / stwcx A / enable-interrupts. That prevents the scheduler from switching threads in the middle of the atomic sequence.

But there was still one more problem. It is possible for a page-fault to occur in the middle of the sequence should it span the end of one page and the beginning of another, and the second page is not in the TLB. So the thread is running along and executes disable-interrupts / lwarx A, then when trying to fetch the next instruction it faults to the hypervisor because it isn't yet mapped by the TLB. The hypervisor executes a bunch of code to add the mapping of the new page to the TLB and then returns to the faulting thread to complete the stwcx A / enable-interrupts sequence.

The problem is that the TLB is a shared resource between the two hardware threads of a processor, so the two hardware threads need a way to atomically update the TLB, and the obvious way to do that is to use a spin-lock that is naturally implemented by a lwarx B / stwcx B pair of instructions. But the hypervisor TLB handler can't use those instructions because the code causing the TLB fault might be in the middle of using them and thus would cause the hardware bug to manifest.

The solution was to use non-reservation load/store instructions to implement a simple spin-lock. If the two hardware threads were both trying to update the TLB then hardware thread 2 would simply wait for hardware thread 1 to clear its lock before proceeding.

markus_zhang · 2024-12-21T14:35:51 1734791751

Thanks so much for the input! I vaguely know a little bit about everything you talked about--the threads, TLB and such, but I have never worked with them in practice. This is so interesting.

fragmede · 2024-12-19T23:08:01 1734649681

That sounds like a lot of fun! What was your career like before that, that led you to be in the place to do such fun things?

kridsdale1 · 2024-12-19T22:50:15 1734648615

This is the coolest HN post I’ve read in months.

Cheers.

cbanek · 2024-12-20T01:36:34 1734658594

Small world! I worked on Yellow Door / Golden Gate automation for releasing 360 titles and patches to prod, and the beta group / KDC service code.

saturn8601 · 2024-12-20T00:58:53 1734656333

You sir are an inspiration! I am but a mediocre Angular developer and reading this has me in complete awe! The kind of drive you must have had to get this done well I dont know how people manage to do it but it is so cool to see! :)

n144q · 2024-12-20T00:04:47 1734653087

> we knew that the custom CPU would not be available until early 2005

Sounds a little bit like the situation with Xbox Series? The SDKs were released late because Microsoft was waiting for certain features in AMD APU

bananaboy · 2024-12-19T22:49:25 1734648565

Amazing! Thanks for sharing! What sort of things are you working on now?