Do we actually know how they're degrading? Are there still Pascals out there? If...

hxorr · 2025-10-03T01:34:48 1759455288

I watch a GPU repair guy and its interesting to see the different failure modes...

* memory IC failure

* power delivery component failure

* dead core

* cracked BGA solder joints on core

* damaged PCB due to sag

These issues are compounded by

* huge power consumption and heat output of core and memory, compared to system CPU/memory

* physical size of core leads to more potential for solder joint fracture due to thermal expansion/contraction

* everything needs to fit in PCIe card form factor

* memory and core not socketed, if one fails (or supporting circuitry on the PCB fails) then either expensive repair or the card becomes scrap

* some vendors have cards with design flaws which lead to early failure

* sometimes poor application of thermal paste/pads at factory (eg, only half of core making contact

* and, in my experience in aquiring 4-5 year old GPUs to build gaming PCs with (to sell), almost without fail the thermal paste has dried up and the card is thermal throttling

oskarkk · 2025-10-03T05:47:34 1759470454

These failures of consumer GPUs may be not applicable to datacenter GPUs, as the datacenter ones are used differently, in a controlled environment, have completely different PCBs, different cooling, different power delivery, and are designed for reliability under constant max load.

fennecbutt · 2025-10-03T08:43:17 1759480997

Yeah you're right. Definitely not applicable at all. Especially since nvidia often supplies them tied into the dgx units with cooling etc. Ie a controlled environment.

Consuker gpu you have no idea if they've shoved it into a hotbox of a case or not

chermi · 2025-10-03T20:29:56 1759523396

So, if anything,maybe were underestimating the lifetime of these datacenter GPUs?

Workaccount2 · 2025-10-03T02:07:00 1759457220

Believe it or not, the GPUs from bitcoin farms are often the most reliable.

Since they were run 24/7, there was rarely the kind of heat stress that kills cards (heating and cooling cycles).

buu700 · 2025-10-03T06:12:19 1759471939

Could AI providers follow the same strategy? Just throw any spare inference capacity at something to make sure the GPUs are running 24/7, whether that's model training, crypto mining, protein folding, a "spot market" for non-time-sensitive/async inference workloads, or something else entirely.

chermi · 2025-10-03T20:28:38 1759523318

I have to imagine some of them try this. I know you can schedule non-urgent work loads with some providers that run when compute space is available. With enough work loads like that, assuming they have well-defined or relatively predictable load/length, it would be a hard but approximately solvable optimization problem.

buu700 · 2025-10-03T20:41:37 1759524097

I've seen things like that, but I haven't heard of any provider with a bidding mechanic for allocation of spare compute (like the EC2 spot market).

I could imagine scenarios where someone wants a relatively prompt response but is okay with waiting in exchange for a small discount and bids close to the standard rate, where someone wants an overnight response and bids even less, and where someone is okay with waiting much longer (e.g. a month) and bids whatever the minimum is (which could be $0, or some very small rate that matches the expected value from mining).