How much redundancy do modern CPU chips have? If a single transistor goes bad in...

teruakohatu · on June 15, 2020

I think the usual reply to that question is that it depends on which transistor.

Not quite the same thing but there was a Raspberry Pi board on the homepage earlier today which has hacksawed in half and still worked, the person who did that has also cut some microprocessors in half successfully and they still work because they were cutting off bits he does not plan on using and which are not required for the rest of the device or chip to function.

I am sure a AMD Ryzen CPU would work without a core or two. In fact they often disable cores before shipping by zapping a fuse. But if the same transistor on every core somehow blew, then you would probably be left with a dead CPU.

starky · on June 15, 2020

To be fair, that guy who cut the RPi merely cut off the USB ports, RJ45 jack, and the ethernet controller and maybe a couple caps. I don't think chopping off a couple of low pin count peripherals far away from the SoC and DRAM counts as "cutting the board in half"

teruakohatu · on June 15, 2020

I did say it wasn't exactly the same thing but at least with some previous *Lake Intel CPUs the gfx took up quite a lot of the die. If you are using an external GPU quite a lot of transistors could fail and you could still use the CPU, just like hacksawing off the ethernet chip.

rwmj · on June 15, 2020

All on chip caches these days have ECC, so the literal answer to your question - if it's a single transistor, it would likely be fine.

However electronics doesn't really fail like that. A single transistor might be "zapped" by a cosmic ray, but that's a transient error. Electromigration causes the copper interconnects between parts of the circuit to break (https://en.wikipedia.org/wiki/Electromigration#Practical_imp...), especially parts that carry higher current for power distribution around the chip. I had an Intel C2000 fail in a server after 3 years because of this (https://www.theregister.com/2017/02/06/cisco_intel_decline_t...).

dannyw · on June 15, 2020

The Intel C2000 issues are nasty. I'm personally refusing to buy a J3XXX series Atom based Synology because of it.

rwmj · on June 15, 2020

Synology were actually great about it. The server was indeed a Synology 8 bay NAS from ~2016, and even though it failed after about 3½ years they replaced it with a refurbished one (which I couldn't tell the difference from new) within a few days. I didn't pay a penny. I hope they're charging all their costs back to Intel.

lostmsu · on June 15, 2020

My thoughts exactly. If a CPU could self-test reliably, and turn off parts of cache, cores, bus lines, etc. I don't think random failures would be a major problem.

Of course it might be too expensive to design such a feature.