I *detest* the crazy industry politics that made ECC memory so “special”

glowingly · on Oct 16, 2022

What I am curious about, is some Nvidia cards like the A2000 have ECC, but only enough chips for a regular roundish number of RAM, like 6GB or 12GB. So when ECC is enabled, 6.25% of the RAM is used for the ECC bits. [0; 1, in the notes]

Since desktop ECC gets around this by having physically more RAM ICs (usually 9 instead of 8, for example), what is the impediment from having a similar solution to Nvidia? I'd readily take a hit to memory capacity* and performance in exchange for ECC.

Why can't the memory controller already do this?

I should note, I'm mostly thinking of my NAS. I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.

[*] in this case, 12.5% if we follow typical desktop ECC allocations

[0] https://www.nvidia.com/content/Control-Panel-Help/vLatest/en...

[1] https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ind...

Tuna-Fish · on Oct 16, 2022

The primary problem is that on CPU memory systems, all the requests are always 64 bytes, and the entire system starting from the CPU caches and ending at the arrays in the DIMMs is designed to efficiently serve those requests at the lowest possible latency.

In-band ECC means significant sacrifice of performance on a system not designed for it. Random read throughput doesn't go down by 6.25%, it goes down by half.

Dylan16807 · on Oct 16, 2022

> In-band ECC means significant sacrifice of performance on a system not designed for it. Random read throughput doesn't go down by 6.25%, it goes down by half.

But adjusting DDR for that could be pretty easy. Instead of a burst of 16 transfers, do 18. It's already set up to stream longer transfers when desired.

There will be more overhead than making the sizes properly match, but it shouldn't be anywhere near cutting throughput in half.

Tuna-Fish · on Oct 16, 2022

> But adjusting DDR for that could be pretty easy. Instead of a burst of 16 transfers, do 18. It's already set up to stream longer transfers when desired.

That's not really how DDR5 works. The granularity of column addresses is (iirc) 32 bytes, and you cannot do transfers that are of any length other than 64 or 32 bytes (and 32 bytes only with burst chop, which means that the bank is busy for the remaining 8 cycles). Bursts longer than 16 are really just multiple adjancent requests, with an optimized command.

You could change this, by completely changing how the memory modules themselves work, and by widening the column address for more granularity. Can't do it well by just tweaking the memory controllers.

Dylan16807 · on Oct 16, 2022

I feel like changing the width of some of the IO components on the modules is closer to "tweaking" than to "completely changing".

I wasn't trying to suggest you could do it by changing only the memory controllers and not the DIMMs.

Tuna-Fish · on Oct 18, 2022

Ah. If you make any silicon changes at all, it is orders of magnitude more expensive and "harder" than just using extra chips like normal ECC DRAM does.

Dylan16807 · on Oct 18, 2022

I'm suggesting a change JEDEC could have made when incrementing the DDR number, basically.

Especially since the ECC overhead on DDR5 is so high.

my123 · on Oct 16, 2022

Intel's chips already support operating in such a mode. See embedded Elkhart Lake and Tiger Lake parts.

(And nvidia Tegra does in-band ECC too)

By the way, RTX 4090 doesn't have ECC disabled: https://techgage.com/article/nvidia-geforce-rtx-4090-the-new...

InvaderFizz · on Oct 16, 2022

> I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.

From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum). Bitrot isn't a factor, at all.

kqr · on Oct 16, 2022

This means that ECC RAM and ZFS are completely orthogonal concerns.

If your data is important enough to warrant ECC RAM, you should get ECC RAM whether you use ZFS or not.

If you want to use ZFS (for its volume management, compression, mirroring, healthchecks, whathaveyou), you should do so whether or not you have ECC RAM.

lazide · on Oct 16, 2022

Most people who care enough about data integrity to use ZFS should also be using ECC RAM for the same reason. Which is most, but not all users.

If you’re using ZFS for other reasons, then you be you I guess.

acdha · on Oct 16, 2022

That is bitrot: you save correct data and it’s not retrievable. The fact that it happens in RAM rather than on the storage media, controller, or I/O channel just makes it a different category.

fmajid · on Oct 16, 2022

It is also far, far more likely that an uncorrected bit flip happens outside the relatively small portion of time the kernel spends in filesystem code. This is not a ZFS-specific problem by any means.

comboy · on Oct 16, 2022

> From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum).

Wouldn't an option to do it twice in different memory regions be nice? I'm pretty sure in many use cases scarifying performance for greater reliability wouldn't be an issue. Given how many cores we have available nowadays it could potentially even not have that much impact on performance.

Also are there any software solutions (like a kernel patch) which would do "software ECC"? I imagine in this case performance hit would be quite devastating but it still could be acceptable trade-off for NAS-like systems where you want to have lots of RAM for dedup and cache but it's not a busy system.

prirun · on Oct 16, 2022

There is still a race condition: if you read data from disk into a buffer, make a copy of the buffer, then do 2 checksums, the bit flip can still occur before the 2nd copy is created.

Dylan16807 · on Oct 16, 2022

Are you worried that the data is corrupted on disk but a random bit flip makes it look right?

Otherwise a bit flip that early during read shouldn't matter because you're checking it against the disk checksum.

If you don't have disk checksums then ECC memory is not where you should be putting effort to keep things safe.

bpye · on Oct 16, 2022

On ZFS, I don't think there's any reason why it needs ECC more than any other filesystem [0].

[0] - https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...

belter · on Oct 16, 2022

It looks like ZFS should be run with ECC

"...We further demonstrate that ZFS is less resilient to memory corruption, which can lead to corrupt data being returned to applications or system crashes..." - https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...

"Please Use ZFS With ECC Memory" - https://louwrentius.com/please-use-zfs-with-ecc-memory.html

PBondurant · on Oct 16, 2022

A fuller instance of the first quote: "Through careful and thorough fault injection, we show that ZFS is robust to a wide range of disk faults. We further demonstrate that ZFS is less resilient to memory corruption, which can lead to corrupt data being returned to applications or system crashes"

...ie, ZFS is 'less resilient' in comparison to its robust disk fault handling, not that it's less resilient to memory corruption in comparison to other filesystems. The parent quotation above implies that ZFS is more sensitive to memory corruption than other fs but that is not claimed in the referenced paper.

puffoflogic · on Oct 16, 2022

Presumably zfs is the most-mainstream fs which will actually notice and complain about in-memory bit flips.

jrk · on Oct 16, 2022

I have always wondered exactly this.

Market segmentation on what platforms/controllers support ECC: fine, whatever. But market segmentation of what is an “ECC DIMM” vs. a “regular DIMM”? It makes no sense that the commodity memory manufacturers have any leverage to enforce that segmentation.

Is it just laziness on the part of the platform vendors (who do have leverage) not simply allow ECC with any DIMMs by giving over 1/k {bits, lines, pages, chips, whatever-granularity-they-reason-about} to parity?

acdha · on Oct 16, 2022

It’s not the memory providers but the chipset manufacturers like Intel pushing customers to the expensive workstation/server lineups.

sliken · on Oct 18, 2022

On the Intel side, yes.

On the AMD side, no.

However Intel guarantees ECC will work on their "workstation" chipsets. AMD doesn't guarantee ECC will work on their desktop/workstation chipsets. You have to go up to Epyc to find a guaranteed/tested ECC.

joshspankit · on Oct 16, 2022

I’m constantly surprised that it’s not commonplace to use on-disk parity files.

It’s so uncommon that the PAR3 format was never really finished and no one has created a replacement that handles subfolders.

Why I’m surprised: Not only does it solve the problem of bit-rot, but the parity files can be moved to USB sticks, NAS drives, Mobile devices, etc and the original files can be verified/repaired by any device that understand the parity file format. PAR2 is still great for photos/audio/video, as well as any flat-folder assets.

Nyan · on Oct 17, 2022

PAR is somewhat unwieldy to use. In addition to needing to explicitly create it (and it not being particularly fast, on a large enough data set), PAR2s can't be 'updated'. The PAR3 spec allows for some limited updating, but it's far from ideal.

It often makes more sense for the file system to deal with ECC in my opinion. PAR probably makes more sense for archived files that aren't expected to change, but may be moved across file systems.

PAR2 handles subfolders by the way, just not empty folders.

joshspankit · on Oct 17, 2022

No exactly: The current PAR format does not make sense for this use-case (including because of the limitations you mentioned), but IMO the technology does.

Files with on-disk ECC can be moved from cloud to cloud, cloud to desktop, filesystem to filesystem, desktop to stick, then stick to NAS all without losing ECC protection. No single filesystem can do that.

Nyan · on Oct 17, 2022

Sorry if I'm dense, but what does "this use-case" exactly refer to here?

joshspankit · on Oct 17, 2022

Fair question. What I’m referring to is file backup and archive for anything up to enterprise level.

So specifically: photography archives, videos (including b-roll for content producers/videographers), project backups, personal files, important documents, etc. Up to and including anything that could be posted to r/datahoarders

Nyan · on Oct 18, 2022

Ah, PAR makes the most sense for archival material like that. What were you looking for in the PAR format that'd make more sense for this use case?

joshspankit · on Oct 18, 2022

There are a few shortcomings:

1. “Lots of tiny files”

Some folders unavoidably have tiny files in bulk (Document backups can be like this. One other example that jumps to mind: macOS applications with translation files)

In these cases, PAR/PAR2 have issues with the block size (can only have one file per block which leads to a lot of wasted space)

2. Tracking changes across filenames

This is counterintuitive, but I’ve run in to this enough to mention it: if the item to archive is a folder where the contents might change over time, any single file might get renamed and it’s contents slightly modified. A parity file tool could look at the blocks that have not changed, recognize the rename, and “correct” the reference before doing more processing. If it’s a valid change to the file: saving the work required to recalculate the whole file and if it’s damage to the renamed file: being able to repair it simply.

3. Being able to update in-place

Sometimes the ideal is to create parity files for a folder, even if that folder is actively used (say for example b-roll that changes by 10% maybe once a month). A parity tool could update that 10% without having to recalculate the whole thing (Ideally this would be adding files similar to ‘git add’ so that someone does not accidentally add file damage to the parity set)

Nyan · on Oct 19, 2022

I see. Yeah, PAR3 tries to address the 'lots of files' scenario better than PAR2.

But updates are problematic. You could 'delete' parts of a recovery file if the data is present. However a file being updated typically means that the original data (before the update) is no longer present, meaning you can't really 'delete' that part of the recovery to replace it with the new data. You could try and retrieve the old data by recovering it, but at this point, it may just be easier to recreate the entire PAR set again.

If it's just adding to the recovery set, PAR3 does have provision for that.

joshspankit · on Oct 22, 2022

I should thank you, though it’s also meant I’ve lost most of the last 3 days:

After our discussion I went and found the current work on PAR2/PAR3 (which, for those in the know: the current PAR3 format is not the old on the was never finished, but a spec that’s been re-written and is close to completion with real-world tools close behind) and I have a lot more hope for the future of parity files.

I still think they are wildly under-utilized (BackBlaze has always used them, but they are the only business I know of), but we might be having a different conversation in 2-3 years.

Dylan16807 · on Oct 16, 2022

The impediment to adding more chips is the same as it is without ECC: more die space/heat/power. The reason they use a standard number of chips, I expect, is that it's easier to manage and GPUs don't care very much about weirder access sizes.

masterofmisc · on Oct 16, 2022

Im currently thinking of purchasing a Synology NAS that comes with BTRFS. Just wondering, do you happen to know if BTRFS also requires ECC RAM to function correctly?

fmajid · on Oct 16, 2022

Bit flips and data corruption affects all filesystems, and all systems stand to benefit from ECC, as shown by Linus' experience.

cdibona · on Oct 16, 2022

Depending on the model, most of the larger capacity synologys require ECC when expanding (1821+ for me recently)

eqvinox · on Oct 16, 2022

Here's some prices for reference, for 32GB DDR4-3200 unregistered DIMMs in DE/AT, trying to compare otherwise equal modules:

-

Mushkin Essentials DIMM 32GB, DDR4-3200, CL22-22-22-52 - 85.79€

Mushkin Proline DIMM 32GB, DDR4-3200, CL22-22-22-52, ECC - 143.00€ (= 1.67×)

-

Kingston ValueRAM DIMM 32GB, DDR4-3200, CL22-22-22 - 117.39€

Kingston Server Premier DIMM 32GB, DDR4-3200, CL22-22-22, ECC - 147.90€ (= 1.26×)

-

Samsung DIMM 32GB, DDR4-3200, CL22-22-22 - 117.39€

Samsung DIMM 32GB, DDR4-3200, CL22, ECC - 152.89€ (= 1.30×)

-

It should be noted however that a bunch of cheap brands do not even offer ECC variants, and those may dominate the lower end of the price spectrum. So getting ECC memory may also involve choosing a more pricey brand.

-

References:

https://geizhals.de/mushkin-essentials-dimm-32gb-mes4u320nf3...

https://geizhals.de/mushkin-proline-dimm-32gb-mpl4e320nf32g2...

https://geizhals.de/kingston-valueram-dimm-32gb-kvr32n22d8-3...

https://geizhals.de/kingston-server-premier-dimm-32gb-ksm32e...

https://geizhals.de/samsung-dimm-32gb-m378a4g43ab2-cwe-a2328...

https://geizhals.de/samsung-dimm-32gb-m391a4g43bb1-cwe-a2755...

rasz · on Oct 16, 2022

This is irrelevant. Back in 286 30 pin simm times when every motherboard supported ECC price difference was ~10%. ECC was still supported on almost ever 386/486 board. Then came Intel with 430 chipsets artificially segmenting ECC exclusively for expensive server market HX offerings. https://en.wikipedia.org/wiki/Intel_430HX

eqvinox · on Oct 17, 2022

Of course Intel segmenting ECC out of the client market is a major factor…

… which does not mean other factors are "irrelevant"…

… and also the price differential in ECC memory is not only a factor but also a metric as it shows that ECC memory has a "non-mainstay" surcharge created by the vast majority of users going for non-ECC. If demand were equal, the price would be 9/8 = 1.125×.

rasz · on Oct 17, 2022

Demand cant be equal when nothing on the desktop supported ECC for the last 22 years (2017-1995), and even now its only a select fragment of ~28% market share. Things would look different if you could reuse server ram in any desktop board. 16-24GB DDR3 ECC ram sticks sell at 4x discount compared to regular old DDR3. I havent looked into DDR4 difference.

ClumsyPilot · on Oct 16, 2022

This has nothing to do with politics, it's market segmentations and it's the job of MBA's and business leaders to do it coreectly.

The current situation demonstrates that the customer has no ability to negotiate product quality.

It's debatable if that's due to ignorance or the game is just so syacked against consumers

WastingMyTime89 · on Oct 16, 2022

To be pedant - this is HN after all - market segmentation is a business policy and relates to how a business is run. The word politics is used correctly here but in a somewhat old fashioned way. Policies would probably be more usual. Linus indeed bemoans that the industry is using ECC as a differentiating factor.

influxmoment · on Oct 16, 2022

MBA culture is more group think than logical business strategies. They'll happily run businesses into the ground

ruined · on Oct 16, 2022

market segmentation is entirely political

cma · on Oct 16, 2022

i.e. capitalism pays vendors extra to corrupt data and price discriminate for those who need protection from the protection racket

nopenopenopeno · on Oct 16, 2022

The market is always downstream from politics. Karl Marx famously used the term political economy to reference both political and economic forces, because neither is coherent without the other.

CrazyStat · on Oct 16, 2022

The term political economy far predates Marx. I've not heard of Marx famously using it, but I don't doubt he did use it since it was an established field of inquiry at the time--the predecessor to the modern field of economics.

nopenopenopeno · on Oct 16, 2022

“In the 21st century, Karl Marx is probably the most famous critic of political economy, with his three volume magnum opus Capital: A Critique of Political Economy as one of his most famous books.”

https://wikipedia.org/wiki/Critique_of_political_economy

CrazyStat · on Oct 16, 2022

I'm not sure what this is supposed to show. Marx being famous for his critiques of political economy is not the same as Marx famously using the term political economy in a certain way.

Political economy as an area of inquiry took off in the late 18th century--in no small part due to Adam Smith--and was already very well established before Marx was born [1]. Absent evidence to the contrary it seems questionable to assert that Marx's use of the term was anything other than the common use of the time.

[1] https://books.google.com/ngrams/graph?content=political+econ...

throwawaaarrgh · on Oct 16, 2022

"Why pay more for something good when you can pay less for something not as good?" - every consumer

enragedcacti · on Oct 16, 2022

I believe this is more about Intel who for probably a decade have sold consumer CPUs with memory controllers capable of using ECC memory but either disabling the feature in the chip, or more recently with Alder Lake, locking the feature behind an enterprise motherboard chipset despite the chipset not being relevant to ECC support.

Having to pay a 20% premium on RAM for the stability of ECC is one thing, having to pay double, triple, or quadruple the system price to disable arbitrary locks is another.

pedrocr · on Oct 16, 2022

With AMD that limitation is somewhat lifted with normal consumer CPUs and quite a few motherboards having support. Unfortunately the RAM itself has very poor availability. Very few manufacturers offer it and it tends to stick to the standard speeds without any of the overclocked-from-factory parts available with ECC. That last bit is strange because ECC is reportedly very good as a validation tool for overclocking builds so it seems like the RGB-lighted market could end up valuing it as a high-end feature too. One of the RAM manufacturers should rebrand ECC to "Extreme Clocking Capacity" and start selling it as a feature to differentiate their RAM from the competition.

justinclift · on Oct 16, 2022

> Unfortunately the RAM itself has very poor availability.

That doesn't match up with my experience, even though I've seen it mentioned several times by people as if it's the case.

For example, when I went looking for ECC UDIMM sticks for my Ryzen 5600X build a year or so ago, I went looking at the website of my local computer supplier.

Many potential options available:

https://www.scorptec.com.au/product/memory/ecc-&-registered

Now, there aren't as many different options as for non-ECC stuff. eg:

https://www.scorptec.com.au/product/memory/ddr4 (many pages of sticks available)

But it's not like theres any kind of availability problem. And the Kingston ram I bought happily overclocked to 3200MHz without any effort on my part. Using an ASRock B550M Pro4 motherboard for reference, if that's useful.

pedrocr · on Oct 16, 2022

It will depend on where you are. I'm in Europe and it used to be that the usual retailers had no DDR4 ECC options at all. Now there are 1 or 2 available and still no options for DDR5 ECC. Looking at those pages your suppliers seem much better than what I've found so far and yet also have no DDR5 options. Meanwhile DDR5 non-ECC is very broadly available and DDR4 non-ECC has great and mature options.

guenthert · on Oct 16, 2022

> That last bit is strange because ECC is reportedly very good as a validation tool for overclocking builds

Well, yes, but it's not the only tool. In practice you'd use some validation software suite to find a reasonable stable configuration and afterwards prey. Overclockers, who rather than spending extra money on a CPU which assuredly delivers requested performance stress their hardware beyond specifications, are least likely to pay for the extra bits.

almostnormal · on Oct 16, 2022

If non-ecc RAM is good enough for most fault sto only cause noticable problems rarely, its possible to sell slightly less perfect modules than with ecc, where even otherwise unnoticable problems are detected and reported. The ratio of modules returned to the manufacturer is probably higher for ecc than non-ecc. Unless ecc modules are already better selected than non-ecc at the manufacturer.

ezoe · on Oct 16, 2022

Who want overclocking the ECC RAM? You choose ECC because of the stability over performance. Why do you want to reduce the stability?

enragedcacti · on Oct 16, 2022

Because given the same chips you can push ECC RAM to speeds beyond what non-ECC RAM can do before corrupting. You can push beyond what the chips are capable of and rely on the error correction to keep your system stable.

sliken · on Oct 18, 2022

ECC in common desktops/workstation/servers will correct all single bit errors and detect all 2 bit errors.

So sure you can run dimms every so slightly faster and fix the occasional single bit flip, but even a single double bit flip and an process or your kernel crashes.

Seems much saner to go for a safe, robust, and reliable system at standard clocks in ECC instead of trying to get slightly more performance which increases the chances or errors, corruptions, crashes, and shorter service life.

acdha · on Oct 16, 2022

How many seconds do you need to think about this to come up with counter examples? There are entire businesses based on selling things which are better or helping buyers find them, suggesting that the rush of perceived superiority isn’t warranted.

Some people can’t afford to buy anything but the cheapest but in most cases it comes down to not thinking that they use it enough to matter (e.g. the common homeowner advice to buy a cheap tool the first time & replace it better if you break it), not having a good way to tell whether something is actually better, or the market being such that there is a huge gap between the product bands. ECC falls into the latter two categories: the average buyer isn’t familiar with the issue and probably thinks the outcome would be a crash rather than silent data corruption, and Intel’s market segmentation means that you don’t have a choice in the consumer space and have to move into far more expensive and limited categories. It’s not reasonable to say price-sensitive consumers are the problem when that’s also saying “stop buying laptops”.

almostnormal · on Oct 16, 2022

> when that’s also saying “stop buying laptops”.

My 4 years old laptop has ecc. To see if anything has changed I just checked the recent models. The manufacturers website didn't provide any possibility to filter for ecc or mobile Xeon, but an explcit text search returned some results. Ecc still exist in laptops, its just a bit difficult to find.

On the plus side, 19:10 screens are back.

marcosdumay · on Oct 16, 2022

How much more for how much better?

Anyway, try to take the technical specification of any consumer oriented motherboard and discover if it allows ECC.

TomVDB · on Oct 16, 2022

Why pay for something I don’t need when there’s the option to not pay for it? - me

Just yesterday, I bought an extra 64GB for my home Linux PC. I absolutely couldn’t care less about it crashing or calculating the wrong result every blue moon (in practice: never), but I did choose the RAM sticks that were $10 cheaper.

_abox · on Oct 16, 2022

That's totally fine. The problem is people that do need it not having the option to pay the extra without getting a totally different "workstation-class" computer.

AdrianB1 · on Oct 16, 2022

The problem is the lack of choice, not your choice. If you were willing to pay $10 more for ECC DRAM, there would be no way to use it.

IYasha · on Oct 16, 2022

Plz tell me your IP, I'll block torrents from you :D

sigstoat · on Oct 16, 2022

torrents have checksums on the blocks, which your client already checks, because the internet doesn't guarantee packets arrive without corruption. it doesn't matter where the corruption comes from.

TomVDB · on Oct 17, 2022

I’d go dither than that and claim that almost everything of importance has checksums: Google Docs, Google Sheets, Git submissions, all my important web accounts, the traffic with my bank website and so forth.

When I look at my daily home computer usage, it’s remarkable how little I calculate on my local computer that’s actually with protecting.

IYasha · on Oct 19, 2022

Yes, I know. It was a joke :)

dwild · on Oct 17, 2022

How many people complains about memory corruption? This is a ridiculously rare issue, which impact is minimal, if not null, for so many people. There's an actual cost to provide ECC memory though, there's actually more hardware necessary to cover it. Obviously if more would need it, it may lower the cost because of the scale, but it will still be more. For some it's important, and for sure they will pay for it, so let them pay for it...

It's not like it wasn't available, Linus just forgot to put a reminder to buy some once available (backordering them would have been even more prudent).

sliken · on Oct 18, 2022

Ridiculously rare? That's why CDs, DVDs, blurays, caches, SSDs, spinning rust, etc all have ECC, right? Even ancient file transfer protocols like xmodem and zmodem has error correction.

Bitflips aren't uncommon, it's just compression makes them much more noticeable. Without compression users might just see an occasional application crash, and just relaunch the proces.s Seems common for people collecting photos, music, etc for years go back and find some corrupted. It's hard to say exactly what happened, but with compression a bit flip it turned into a seriously corrupted file.

Keep in mind that, radiation caused bit flips are rare, some failure in the chip, pin, dimm, dimm connector, motherboard, CPU pin, CPU are much more common. It's MUCH nicer to see "error on dim X, row Y, column Z" then a randomly crashing machine. Even Linus sounds like he spent hours tracking it down, and thought he was hitting a kernel bug.

Isn't it worth 1/9th more memory chips to make the system robust in the face of wide variety of errors that can corrupt memory, which can lead to corrupted storage?

fmajid · on Oct 16, 2022

More like "Hey, nice mission-critical database you have, deep-pocketed enterprise customer. It would be a shame if a bit-flip happened to it".

vbezhenar · on Oct 16, 2022

Why buy mac pro for $5000 when you can buy trash can for $5. Even if it's not as good, it's much cheaper and does not look much different anyway.

drewg123 · on Oct 16, 2022

"Why are all products so unreliable now!?!?" -- Also every consumer, when their appliance breaks one day out of the 2 year warranty

duxup · on Oct 16, 2022

While I think that is true, do we have much choice in this case?

The computing world is full of more and less featured products.

heynowheynow · on Oct 16, 2022

Largely ignorant buyers of mass-market, commodified devices lead to intense cost wars that shed features that aren't "essential." If a buyer wants ECC in a laptop or desktop, typically they have to purchase a workstation-class machine that costs 4-5x of low-budget models.

Memory issues can be caused by the PSU, motherboard, the CPU, or the memory itself. Personally, I always run memtest86 and memtest86+ for 2 days on any new components.

Out-of-date but pertinent: https://cr.yp.to/hardware/ecc.html (c. 1999-2006)

Other risks: https://defcon.org/images/defcon-21/dc-21-presentations/Schu...

The mitigation of bitsquatting requires ECC in network gear also. Furthermore, ECC isn't just about the memory type but having integrity in data buses, caches, storage, and network protocols also.

Home system is 96 thread, 512 GiB (16 x 32GB ECC Registered DDR4-3200), numerous SSDs and HDDs running RAID.

_abox · on Oct 16, 2022

Agreed, Intel's policy is very bad for consumers. ECC should be everywhere. It should only cost 1/8th extra.

bpye · on Oct 16, 2022

It does look like some Intel 12th gen and later parts support ECC [0, 1] - but not all as I don't see it on the i3, Pentium or Celeron parts... That said you apparently require a W680 chipset motherboard [2], so that's still going to be expensive. I much prefer the situation with AMD where ECC should work on all parts, even if not all motherboard mfgs enable or validate it.

[0] - https://ark.intel.com/content/www/us/en/ark/products/96149/i...

[1] - https://ark.intel.com/content/www/us/en/ark/products/134591/...

[2] - https://www.anandtech.com/show/17308/the-intel-w680-chipset-...

_abox · on Oct 16, 2022

It's funny because the Pentium G1610T that my old Microserver G8's came with did have it. Because those Pentiums are Xeon-Pentiums somehow. But they lack hyperthreading and AES-NI making them really crap for server tasks.

Intel's marketing is really in a league of their own. As soon as their branding starts making sense they will change it.

bpye · on Oct 16, 2022

I'm actually still running one of those old G8 Microservers, though I swapped the crappy Pentium for one of the Xeon parts. I've looked a few times and I still haven't found anything that would be a good replacement.

muxator · on Oct 16, 2022

Same here, my friend. Hp Microserver gen 8 upgraded with a xeon e3-1265L-v2.

I am hoping for a good successor (ECC, enough bays, not too fast but cheap enough) to come out.

It would be nice to have a nice little server which consumes 15-20 Watts.

_abox · on Oct 17, 2022

The Microserver G10 Plus is not too bad. Replaceable CPU, iLO option, half the size of the G8 (though external power brick)

The regular G10 one is pretty useless obviously. Soldered AMD CPU, cheap design etc. No iLO. It's more of a home entertainment thingy, nothing enterprise-class at all.

I agree I'm looking for something lower power too. I have three G8's but they are each doing 50W idle so I can't run them 24/7. And this is with the lowest-TDP processor that is available for them! (E3-1220Lv2, 17W TDP). The iLO alone is consuming 5W even when it's off.

I'm not sure how much power the G10 Plus draws but I haven't really considered it. It's too expensive still (I bought all my G8's for 175-200 euro each and 2 of them even had a 60 euro cashback on top of that!! So I barely paid more than 100 for them brand new. Crazy cheap pricing for a well-built 4-bay server. Each of the drives inside them cost more than the server itself :)

But 15-20W I think is very ambitious with 4 3.5" drive slots. 30W would be doable I think.

orangepurple · on Oct 16, 2022

I'm waiting for Linus to discover the power supply industry conspiracy next lol

Joking aside, power supplies are probably the next largest source of random hardware issues in PCs today

bioemerl · on Oct 16, 2022

I'm confused, I've built a couple of computers and I've never had issues with the power supply. The only weird issue I've had so far was with a first generation ryzen processor causing hard locks on the computer when it was left on for more than a couple of days.

Also with some bad memory causing really really weird issues all across the system.

Power supplies? And some of the cheap ones I've used have just done their job and not caused any problems.

Do you have any examples of what's going on with this?

bbojan · on Oct 16, 2022

I just recently had a problem where writes to M.2 SSD wouldn't work due to a faulty PSU. Everything else was fine.

Took me 3 SSDs, 2 motherboards and 2 PSUs to figure out what the cause of the problem was.

It was a ThermalTake TR2 500W PSU. No graphics card (integrated graphics) so 500W should have been fine.

ChuckNorris89 · on Oct 16, 2022

Power supplies are like car tires. It's not the right place to penny pinch. Always buy from reputable vendors. Thermaltake is not one of them.

bbojan · on Oct 16, 2022

I wasn't aware of that. This is my first time assembling a PC after probably 10 years, and I remember TT when they were making (good) coolers.

aasasd · on Oct 19, 2022

It's probably that they made good coolers and then decided to capitalize on the name by selling everything under the moon—i.e. reselling someone else's stuff.

giantrobot · on Oct 16, 2022

The total wattage of a PSU isn't the issue. As another comment points out, it's the number of rails and amperage per rail. If the motherboard is drawing a lot of amps because of the CPU and you stick another high current device on it, you will run into issues. If you had a SATA SSD that drew the same current you wouldn't likely had any issue since it would have been on a different rail.

IYasha · on Oct 16, 2022

Oh, yeah! The T-word again! I've had nothing but trouble with ThermalTake PSUs! One even tried to kill me. Never buying that garbage again.

sliken · on Oct 18, 2022

When possible I try to buy from companies that make a product, as opposed to just name/market/make cool stickers.

ThermalTake is just a reseller, Season is an example of a company that makes power supplies. Seasonic sells to others for relabeling, but also sells direct to consumers.

Waterluvian · on Oct 16, 2022

A 500W power supply doesn’t mean much when the important voltage rails see such a small amount of that wattage.

If you get a stock grey metal 500W power supply and try to run a modern graphics card you’ll have a far worse time than some Corsair 400W PSU.

I haven’t build a PC in a decade so maybe this is old news.

alias_neo · on Oct 16, 2022

I think a lot of people still don't know this stuff and it's quite reasonable to expect people not to know this stuff.

Not everyone is an electronic engineer, and if the overall wattage doesn't give the information people need, the sales/marketing should be forced to provide the necessary details right there on the box in "plain language"; RAM speeds are still a pet-peeve of mine.

That said, I've been building PC's for ~25 years and I'm yet to have a PSU fail on me. Anecdotal I know, and with some skew as I've always tended to build near or at the top end.

That said, the Ryzen hard lockups were a real disappointment for me, after spending north of £3500 on my desktop in the OG Ryzen era with an R7 1800X, I was left with a machine that frequently hard-locked when compiling code on all cores and AGESA is such a mess that if it boots at all, it takes several minutes at times to make it past POST.

I really hope the next gen build I make, whatever it is is more stable because I'm not planning to go back to Intel any time soon.

CamperBob2 · on Oct 16, 2022

What caused the hard lockups in your situation?

alias_neo · on Oct 17, 2022

It's believed to be a fault with the Ryzen chips, though don't believe AMD ever officially addressed it (although they were RMAing them repeatedly for hte issue). I sent it back a couple of times for replacements but the lottery wasn't in my favour, and although the one I ended up with is better, it's not perfect;

https://www.phoronix.com/review/new-ryzen-fixed

I was an early adopter, and I accept that but it still left a bitter taste none the less, and ultimately it just meant that my pretty expensive computer barely got used because I hated dealing with it (still have it but barely boot it these days).

Dylan16807 · on Oct 16, 2022

It's more like: If you buy a very old PSU it won't devote enough to 12v.

The stock grey metal models are going to be out of date, but they're not that out of date. At this point I'd expect them to have a reasonable rail balance for a modern computer.

enragedcacti · on Oct 16, 2022

I can't speak to what the other poster was talking about but NVidia 3000 series caused a huge stir because it would jump in power consumption much faster than previous generations and many power supplies couldn't provide the wattage fast enough and would brown out the system. It was especially bad for low and mid-tier PSUs that were nearing their rated wattage limit at peak power.

edit: I misrembered some aspects of this, The 3000 series was actually spiking much higher than rated TDP and tripping overcurrent protection versus just browning out. NVidia did recommend 850w PSUs for that first release of cards for that reason but iirc some 850w still had issues.

bioemerl · on Oct 16, 2022

I put heavy blame on Nvidia for that one. There is no reason their cards should be pulling these huge transient spikes. By putting these cards out with this flaw, then pointing at the power supply manufacturers, they are causing a problem and blaming someone else for it.

The power supply is just doing its job, when you far exceed its rating for even a short amount of time it's going to try to shut down to protect the computer because it thinks something inside of your computer is burning up.

enragedcacti · on Oct 16, 2022

you're right, I forgot about the fact the spikes were far in excess of the TDP of the card, NVidia should have taken a lot more responsibility rather than let PSU mfgs take a lot of the heat.

navjack27 · on Oct 16, 2022

It (3090) actually tripped out my quite expensive quite high-end seaSonic prime titanium 850 watt power supply. But it was perfectly fine on my old EVGA bronze 850 watt

Spooky23 · on Oct 16, 2022

My team ran a desktop fleet for a few years. Power supplies were a top 3 failure component, and tended to go in waves when an OEM got screwed by counterfeit capacitors or other components.

My guess was that bad power was a contributor to many other issues, but the nature of the SLO was such that more complex issues resulted in a device swap. Also, any kind of dirty environment drives higher AFRs.

beebeepka · on Oct 16, 2022

Two builds is nothing. Since everything computers do is powered by electricity, a good PSU is paramount and literally the only thing I never go cheap with. Consider yourself lucky because a bad one can easily wreck your entire system.

Cheap cases used to come with crap power supplies even 15 years ago. It's great that's no longer the norm. Everyone wised up a bit on this one

bioemerl · on Oct 16, 2022

Not really two. I have used 6 power supplies in total.

Maybe I've just been lucky or I've just been buying the quality stuff, two of the ones I'm running now are 1,200 Watts, the other two are in computers that don't use much power because they are basically idle servers all the time. They are also semi reasonable quality.

But I have been running a computer on a 450 w power supply that was very much having its limit pushed.

I also did run on a bottom of the barrel super cheap 500 w supply that I bought in the middle of 2020 when all of the supply crisis was happening.

That is why I was curious. My experiences didn't line up with what the OP was talking about, so I was interested to see examples of what was going on.

trhr · on Oct 16, 2022

If you're not using a SeaSonic, you're still using a cheap power supply.

sgtnoodle · on Oct 16, 2022

I think PSUs are a very common factor in system instability. If you're suffering from lock ups or reboots and you aren't overclocking, I would look to the PSU first. That being said, you can go a little crazy if you pay too much attention to detailed PSU benchmarks. there's a threshold where a PSU is good enough. I mainly look at the efficiency curve when shopping for a PSU. A more efficient supply won't get as hot for a given load, and so its components should last longer, and should have more headroom to degrade before they eventually do fall out of spec.

I reused a 10 year old "gold" 850W Corsair PSU in my zen3 build. It had some noticable inductor whine when the system was idle and I moved the mouse. After a year I got a 6800xt GPU, and it kept trucking along for another few months. I did some GPU overclocking and stress testing one Saturday, and my PC wouldn't start the next day. It had a good run for PSUs I suppose; it outlasted its 7 year warranty. I bought another 850W Corsair PSU to replace it, with "titanium" efficiency rating and a 10 (or was it 12?) year warranty.

orangepurple · on Oct 16, 2022

99.99% of the time there is no problem. But if there are electrical gremlins present then there is a good chance the PSU may be to blame.

sliken · on Oct 18, 2022

I see issues all the time with external power cubes. Devices goes wonky, look up the spec, find it's DC 12v @ 2 amps or whatever and replace the power supply. About 90% of the time the device is happy again.

stjohnswarts · on Oct 16, 2022

This took me a while to learn when I built my first couple of machines. I never go cheap on motherboard or power supply nowadays. Not saying I buy top tier, but I don't go cheap and I refuse to help anyone build one with cheapo power supply.

_abox · on Oct 16, 2022

I would argue it's amateur PC builders not speccing the PSU correctly that cause these issues.

In prebuilt PCs you don't see such issues because the manufacturer controls everything. They might not make the PSU itself but they will certainly get it made with the right specs.

In a home-built PC you need to consider not just the total power (wattage) but also the number of rails and current (amperage) per rail. Exceed that and you will run into stability problems. Also, cheap means you get what you pay for. Get something made by Delta and Seasonic and you will have far fewer problems. Some of the cheap no-name brands don't even specify what kind of rails are in there.

NelsonMinar · on Oct 16, 2022

It's wild to me that we gave up hardware error correction on memory at the same time we increased memory sizes about 1000x, shrinking the die (and thus reliability) by a roughly similar amount.

sliken · on Oct 18, 2022

This is true, but even today bit flips per GB/hour are still really low.

However failures in the memory chip -> chip pin -> dimm -> dimm connector -> motherboard -> CPU socket -> CPU pin -> CPU are pretty common. Sure ECC helps with random bitflips, but it's also very useful to diagnose something is broken in the CPU <-> memory chip pipeline. It's very frustrating to debug something when the main sign of the problem is a reboot.

trynewideas · on Oct 16, 2022

I mostly just love the idea of Linux kernel development slowing down or halting because Linus got a bad stick of RAM and doesn't like working on his laptop.

jpgvm · on Oct 16, 2022

It's not that he doesn't like his laptop it's just that horsepower wise it simply doesn't compare to his monster Threadripper machine when it comes to compiling kernel trees.

fmajid · on Oct 16, 2022

It's not politics, but economics. Near textbook perfect price discrimination (market segmentation) by a profit-maximizing quasi-monopoly (Intel) to extract the maximum surplus out of consumers.

I miss the old, less politically-correct Linus who didn't pull his punches.

Arcuru · on Oct 16, 2022

> PS. And yes, my system is all set up for ECC - except I built it during the early days of COVID when there wasn't any ECC memory available at any sane prices.

How much was ECC ram at that point? Linus is incredibly wealthy and was building a machine that would be used to gate Linux releases, so I'm very curious how much was not "sane" for his use case.

zdw · on Oct 16, 2022

When I last looked DDR4 ECC UDIMMs were 2x the price of non-ECC UDIMMs of the same capacity.

This feels very wrong when the only difference is one additional chip that in terms of material only should increase the BOM price about 12.5% (going from 8->9 memory chips).

plasticchris · on Oct 16, 2022

Maybe it is a volume issue - not enough are produced to benefit from economies of scale in the same way

IYasha · on Oct 16, 2022

But the chips are identical AND PCBs are compatible! I've seen many non-ECC modules with just one chip not soldered on. Unbuffered, of course.

justsomehnguy · on Oct 16, 2022

It's 3-5 orders difference in the volume (and most of the time people buy high capacity modules) and ECC errors makes very obvious the need of replacing the module, while unregistered errors on non-ECC modules are just other 'something gone wrong' type and couldn't be diagnosed easily.

Also, onr of the primary markets for ECC UDIMMs are server/pro-workstation, thia alone.adds at least 10%

warmwaffles · on Oct 16, 2022

> How much was ECC ram at that point? Linus is incredibly wealthy...

I don't think Linus likes wasting money for the sake of wasting it.

aborsy · on Oct 16, 2022

How frequent are bit flips in RAM, causing errors undetected by non-ECC RAM?

mctt · on Oct 16, 2022

Good question, so I Googled some answers:

How common are RAM bit flips?

Research has shown that a computer with 4GB of memory has a 96% chance of having a random “bit flip” every three days. That's a crazy high chance of data corruption occurring on your computer.18 Mar 2021

https://www.macobserver.com/columns-opinions/devils-advocate...

How often do ECC errors occur?

These can all not be corrected, but are extremely rare. A 1 Gigabit ECC DRAM contains 16 Million blocks of 64 bit datawords. Per each of these 64 bit words, one error is correctable. In other words: Statistically one out of 16 million hits might be a double-bit error.

https://www.intelligentmemory.com/support/faq/ecc-dram/how-o...

warmwaffles · on Oct 16, 2022

I wonder how much more likely it is for random bit flips in machine shops with Thoriated tungsten rods near a CNC machine.

dis-sys · on Oct 16, 2022

All my workstations at home use ECC Reg RAM purchased online from ebay/alibaba. Most of them are from vendors selling "decommissioned" server parts. There are several good incentives to do that -

such second hand server parts are cheaper than normal consumer grade parts you get extra protection & performance

256GB (32GB x 8) Samsung DDR4-2933 ECC Reg RAM for USD $550, it is pretty hard to say no to that.

fabianhjr · on Oct 16, 2022

I had a similar issue at the beginning of the year; some g. skill ddr4-3200 (with xmp) started bit-flipping on me and causing crashes.

Sure enough on memtest64 those issues were clearly diagnosed.

Haven't changed to ECC hardware, disabling xmp to ddr4-2333 helped with stability.

Nursie · on Oct 16, 2022

I replaced the RAM in my storage machine because of this earlier this year too.

ZFS was reporting all sorts of errors, yet drive tests were showing no issues. I bought about 3 new drives before I realised what the root cause was. A real PITA. Next time I do a hardware refresh, ECC is definitely on the menu.

helsinkiandrew · on Oct 16, 2022

> I absolutely detest the crazy industry politics and bad vendors that have made ECC memory so "special

Haven’t bought RAM for awhile, what’s he talking about? ECC RAM should be at least 1/8 more expensive (plus something for the handler)

xxs · on Oct 16, 2022

In reality ECC is like twice the price, the CPU support is close non-existent too (Intel just has been disabling in the memory controller for ages... unless it's an i3 laptop - then it's available again)

Just try and buy a reasonable non-server class machine that has ECC.

IYasha · on Oct 16, 2022

AFAIR, every AMD64 CPU has ECC support. But not every motherboard had necessary layout and BIOS support. That's why I'm 20 years with AMD and choosing components carefully. Every system with > 4GB RAM should have ECC. Proven decades ago.

my123 · on Oct 16, 2022

Not all. Was purposefully locked out of the Ryzen APUs, and parts without the iGPU enabled based on an APU die.

And that, until Ryzen 7000.

IYasha · on Oct 16, 2022

Yeah, I was talking about CPUs precisely. On my APU, unfortunately, there's no ECC support even if I shove a module in.

PS: and being a pre-ryzen AMD user, I don't know much about new CPUs too :))

xxs · on Oct 16, 2022

They (ryzen) sort of do - unofficially. Which also means that the motherboards won't list the memory compatibility, either... or tune the bios.

sliken · on Oct 18, 2022

Varies by board, ASRock (at least some boards) supports it, documents it, it's in the BIOS.

It's not unofficial, AMD lists it in their specifications.

sliken · on Oct 18, 2022

Not sure where you are. But I just compared Crucial DDR5 32GB DDR5-4800 (PC5-38400) and the non-ECC (on CDW.com) is $186.99 for part (CT32G48C40U5) and $229.99 for part (MTC20C2085S1EC48BA1R) a 22% premium.

All Alder Lake chips (the ones shipping in volume today) support ECC, check ark.intel.com to be sure. For example the popular i7 model (https://ark.intel.com/content/www/us/en/ark/products/134591/...) As do all Ryzen desktop chips (5000 and 7000) do as well.

Granted popular consumer products don't have ECC, but it's pretty straight forward to built it yourself. Just make sure the motherboard you buy (Intel or AMD) supports ECC.

jeffbee · on Oct 16, 2022

It's the other way around now. The 12th gen "Core" desktop parts all support ECC in the i5/i7/i9 SKUs.

temac · on Oct 16, 2022

Does it still need to be allowed by specific chipsets, though ?

jeffbee · on Oct 16, 2022

Yes, you need the W/R680 chipset on this generation.

teaearlgraycold · on Oct 16, 2022

It's also slower. So if you also want performance parity you need faster chips. The scam is with Intel making it hard for consumers to opt into ECC without getting the blessed chipset/CPU combos.

layer8 · on Oct 16, 2022

The problem is not just the price, but also that you need a “server” CPU and/or mainboard/chipset, which comes with additional trade-offs. Intel has been artificially restricting ECC support to its server product lines.

Maursault · on Oct 16, 2022

> it was literally a DIMM going bad in my machine randomly after 2.5 years of it being perfectly stable. Go figure.

I suspect that the final phrase above is sarcastic, as well as instructive and even possibly bragging, and Linus is perfectly aware that his machine gets an exceptional amount of use compared to the average Peanuts character. RAM wears out, "And some of the degradation is noticiable if you use it intensively (as servers do)."[1]

[1] https://superuser.com/questions/1568933/does-ram-degrade-ove...

temac · on Oct 16, 2022

IIRC Intel "consummer" parts are qualified for 3 (or 5 ?) years of usage at a 30% duty cycle (and you might not find that figure publicly, but honestly you should, and intel should be mandatted to publish it). Now that is likely not exactly what RAM vendors are doing, but likely to give a rough idea of what you should expect from consummer electronics. I'm not even sure all small Xeon have better specs, but for sur you can source some models in volume for a quite low extra cost, qualified for a higher duty cycle and more years of usage.

Maursault · on Oct 16, 2022

> qualified for a higher duty cycle and more years of usage.

Which is what? Double the 30% duty cycle? Would it be so beyond the realm of belief that Linus' box has a 80%-90% duty cycle? I'm not exactly sure where your position falls, that the RAM was defective, or that it lasted longer than expected given its increased use.

temac · on Oct 17, 2022

Expected by who? Those who have access to private info? As far as I'm concerned, I don't see why I should expect anything less than 100% usage if a vendor does not warn me otherwise, and countries exists where mandatory warantee is 2 years.

And that's just about the law. Don't get me started on resource consumption and on my perfectly fine 10 years old computers I'll be forced to stop using soon because of litteral planned obsolecence (of MS, that's another story but a major actor of the computer industry too, and that will yield even more environmental destruction)

Maursault · on Oct 18, 2022

>>> IIRC Intel "consummer" parts are qualified for 3 (or 5 ?) years of usage at a 30% duty cycle

Expected by you.

natch · on Oct 16, 2022

>And yes, my system is all set up for ECC

From reading this, I guess one has to do some special setup to let a system use ECC?

Been thinking about ECC myself. What would I need to do, apart from buying the DIMMs and putting them in? Some BIOS settings? Jumper settings?

rdpintqogeogsaa · on Oct 16, 2022

I've been through this ordeal recently, but I'm probably missing something anyway.

You need to have a compatible CPU/motherboard/chipset. For normal CPUs: AMD Ryzen non-Pro APUs don't have support for it, the rest of AMD's CPUs and chipsets have unofficial support for it. You'll have to check the motherboard vendor's support page if a certain board also has support for ECC. Then you need ECC memory modules and you should stick near the qualified vendors list (QVL) here since systems are kind of pickier with ECC memory. For Intel, you're out of luck except for the W680 chipset, but motherboards seem to be scarce.

For high-end desktop (HEDT) and workstations CPUs: AMD's Threadripper lineup have official ECC support, but still check with the motherboard vendor first. For Intel, most Xeons should do it, but check before you buy. The same caveat about motherboards applies here, too: Check if there's ECC support first and stick to the QVL to be safe.

sliken · on Oct 18, 2022

Pretty much all Ryzen 3000 and 5000 CPUs support ECC. The 4000 G series only with the pro models.

Here's the link: https://www.asus.com/support/FAQ/1045186/

natch · on Oct 16, 2022

Thank you! Great info.

jcynix · on Oct 16, 2022

Asus motherboards mostly offer ECC support with AMD CPUs:

https://rog.asus.com/forum/showthread.php?112750-List-Asus-M...

I just built a desktop machine with such a board, a matching Ryzen and 128GB Kingston ECC memory. Works like a charm, the only problem is the on-board Intel Ethernet chip which ignores Wake-on-LAN (although it's supposed to handle it) so I had to add a PCIe ethernet card to get WoL running. Asus and Intel seem to discuss whose fault it is since two years, sigh.

jeffbee · on Oct 16, 2022

You need system software to make it work right. You want to configure your system to halt as soon as possible after uncorrectable errors. You also need to prominently log correctable errors. How you achieve this is going to vary by hardware platform and operating system.

natch · on Oct 16, 2022

That makes sense, thanks! Ubuntu in my case atm.

jeffbee · on Oct 16, 2022

It's going to vary based on whether you have a BMC or not. If you do, it's better to disable all the OS EDAC stuff and let the BMC handle it.

Otherwise, a good starting point would be to boot linux with `mce=0` so it panics ASAP upon uncorrectable errors.

In Ubuntu, there are the rasdaemon, mcelog, and edac-utils packages.

natch · on Oct 17, 2022

Cool, will look into that.

jcynix · on Oct 16, 2022

As for the BIOS settings (Asus motherboards for AMD) look under:

AMD CBS -> DDR4 Common Options -> Common RAS -> ECC Configuration

lbotos · on Oct 16, 2022

rdpintqogeogsaa hit the nail on the head.

Linus uses a Threadripper machine which supports ECC, and non-ecc.

Most Mobos support non-ecc and maybe ECC if it's AMD and the supplier wired it up. (ASUS and someone else I can't recall seem to do so, Gigabyte does not appear to)

natch · on Oct 16, 2022

Thanks! ASUS pro MB here (on the sole non-Apple device around here lol) so maybe it will work. Also will keep all this in mind when upgrading, which could be in the cards.

the_arcadian · on Oct 17, 2022

I understand why any commentary from Torvalds is notable, but when I read this my takeaway isn't that we need ECC everywhere, its that a hardware component failed much sooner than it should have (I would have liked to have known who the manufacturer was) and that his ability to accurately diagnose the failing component was hampered by a shortage of effective diagnostic tools and perhaps his own lack of hardware troubleshooting experience. But maybe I'm missing something here.

timzaman · on Oct 16, 2022

I wonder if he couldnt have used the memtest kernel parameter:

memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest Format: <integer> default : 0 <disable> Specifies the number of memtest passes to be performed. Each pass selects another test pattern from a given set of patterns. Memtest fills the memory with this pattern, validates memory contents and reserves bad memory regions that are detected.

rocqua · on Oct 16, 2022

Supposedly modern RAM already has built in ECC because of rowhammer etc. It's just that they don't report errors back like true ECC ram does.

layer8 · on Oct 16, 2022

It doubt that, because it requires 12.5 % more bits to be stored, with corresponding increase in cost.

ECC also doesn’t fully protect against Rowhammer, in particular if errors remain unreported: https://news.ycombinator.com/item?id=18508692

lbotos · on Oct 16, 2022

I think GP is talking about DDR5 "on-die ECC".

Dylan16807 · on Oct 16, 2022

The percent increase depends on how many bits you're protecting in a group. It can be a lot smaller.

sliken · on Oct 18, 2022

DDR5 does have "on chip ECC", but that doesn't protect as much as full ECC. Bit errors can be introduced after the bits leave the chip. It could happen on the ram chip pins, in the DIMM, the DIMM connector, the motherboard, the CPU pins, or in the CPU.

So it's not just the error reporting, it's protecting against errors on the whole pipeline, not just inside the chip.

andromaton · on Oct 17, 2022

Historical note: PC RAM modules (SIMMs) in the 80's and 90's commonly had 9 bits per byte. The extra bit was parity, a simple form of error detection. Eventuality, as a cost saving measure, some modules faked the parity bit.

sschueller · on Oct 16, 2022

SFP the similar shit is going on and HDDs/SSDs are being vendor locked now as well.

bpye · on Oct 16, 2022

I don't get it for SFPs either, it seems easy enough to get SFP modules coded for whatever NIC or switch you have, so it seems like it doesn't even work.

hatware · on Oct 16, 2022

RAM errors are the last thing you'll test before your computer is fixed.

animitronix · on Oct 16, 2022

Didn't Torvalds have an epic rant on ECC memory at some point?

xxs · on Oct 16, 2022

of course he did, hence that part of 'detest' is a PS

metadat · on Oct 16, 2022

What is PS?

xxs · on Oct 16, 2022

PS: "post scriptum"

it's a clear part of the original email.

adrian_b · on Oct 16, 2022

Intel is a big company, so during its many decades of activity there have been many Intel employees who have done a lot of good things for the progress of the computing industry, but there have been also too many Intel employees who have made horrible decisions which have caused millions of Intel customers to lose a large amount of money and time, which is impossible to evaluate, because in such cases there is not enough information to disambiguate the causes of various incidents between various kinds of hardware problems and software bugs.

The most damaging Intel decision was about the ECC memory, but there were also many others that are less impactful, e.g. the various ugly workarounds for the laziness of Microsoft of adding various necessary features to Windows, e.g. the System Management Mode or the Management Engine.

The problem with the ECC memory has been created by Intel from 1994 to 1995, when Intel has split their top line of CPUs into two branches, Pentium (the second generation of Pentium, @ 90 or 100 MHz) and Pentium Pro.

For more than a decade, since the introduction of the IBM PC, all compatible personal computers had implemented memory error detection, even if it was possible to use memory modules without error detection, if one did not care about the reliability of the computer.

With Pentium and Pentium Pro, Intel has decided to introduce a market segmentation feature and they have reserved the use of ECC memory for the "professional" Pentium Pro, while removing the support for memory error detection from the Triton chipsets made for the Pentium CPUs (at that time, before AMD integrated the memory controller, the memory controller was still a part of the external northbridge chip).

The successors of Pentium Pro have been rebranded as "Xeon" and they have continued for a long time to be the only Intel CPUs with support for ECC.

The so called "market segmentation", even if it is practiced by a large number of companies, is just a combination of fraud with blackmail, which should have been forbidden by law in most cases.

To introduce market segmentation, a company takes advantage of the fact that the majority of its customers are naive and they are not able to evaluate correctly the quality of a product that they purchase.

The company then uses this fact to extract much more money from the fewer customers who actually know how to evaluate the quality. For this, the company convinces the naive customers that a lower quality is good enough for them and then the company lowers by various means the quality of the products sold at a decent price, in order to able to request an overprice from the quality-aware customers, who are forced to pay, because they do not have an alternative, since the products at the right price do not have the right quality.

This scheme would not work in a competitive market, but when there are few competitors they usually follow the example of the first company which did that and they introduce the same market segmentation policy, because this will increase the profits for all.

Now, because AMD did not disable ECC in Ryzens, even if AMD has provided much worse software support for this feature than Intel (in the EDAC device drivers), at least until recently, Intel has been forced eventually to enable ECC in many models of Alder Lake and Raptor Lake.

Nevertheless, after many decades of lack of support in consumer CPUs, there is a lot of inertia to overcome in the availability of ECC.

Even if now it is easy to find Intel desktop CPUs with ECC support, the ECC support on motherboards requires the special workstation chipsets, so the socket LGA 1700 motherboards with ECC support are hard to find and they are either expensive or with underwhelming features.

All Intel CPUs for mobile applications (the U, P and H series) continue to lack ECC support. Only the HX series for laptops, which are desktop chips packaged in BGAs, have ECC support.

Previously, even AMD had implemented a market segmentation by disabling ECC in their laptop CPUs. That has changed in the Ryzen 6000 Rembrandt series, which have ECC support. However the ECC support remains theoretical, because until now no laptop manufacturer has introduced any laptop with an AMD mobile CPU and with ECC memory, so there is no competition yet for the Intel mobile workstations.

spicyusername · on Oct 16, 2022

As opposed to having it be the default?

db48x · on Oct 16, 2022

Yes. It should have been the default all along, but Intel has been using a strategy of market segmentation for many decades. This has made ECC rare in home computers (even in servers it isn’t ubiquitous), and more expensive than it needs to be. You have to pay extra for the CPU that supports it (even if the Xeon chip you buy is otherwise identical to the i7 you could have bought). The motherboard costs extra too, naturally. The memory itself has to have a ninth memory chip on it so you expect it to cost a little more, but usually it costs a lot more and isn’t made to the same specs. You end up with the sad choice of overclocked ram or safe ram in your gaming machine, and most people go with fast.

Scene_Cast2 · on Oct 16, 2022

Doesn't have to be default, just nice to have it as an option.

Currently, it is difficult to buy a NAS (where bit flips are arguably more important to avoid) with ECC memory, unless custom-building and carefully selecting parts.

tester756 · on Oct 16, 2022

what's going on here? is Linus' PC something like CI/CD server for Linux development?

almostnormal · on Oct 16, 2022

Is he known as Linus "Jenkins" Torvalds?

themitigating · on Oct 16, 2022

Just some points

- The amount of memory that leaves the factory that has errors is a low percentage, not sure what it is but we can all agree it's low.

- You can run memtest when you first install memory for several hours or so to be 99.99% sure your memory is good

- There are outside influences and EXTREMEY RARE cases (cosmic radiation etc) where you may still get an error. If you had ECC it would protect you.

- If you start getting errors (as in more than the one off cosmic flare) later you can test the memory again and determine it's damaged

- Most memory you buy for builds has a lifetime warranty

- ECC costs more and reduces performance by some amount

So the only danger of not having ECC is exceedingly rare memory errors. If those rare instances mattered that much then you should get ECC but for everyone else how is it worth it?

The argument in this post is that he wouldn't have had to go through this since the ECC would have covered the error. I just don't see the value of adding this safety system.

_abox · on Oct 16, 2022

The problem with faulty memory (that could go bad after purchasing) is not knowing about it. ECC doesn't always protect you (it can only fix one bit flip, not more), but at least you will know the memory is bad. You will not keep working with data that is being corrupted.

adrian_b · on Oct 16, 2022

Yes, early warnings about bad memory modules are probably the most useful ECC feature.

I have always used only ECC memory in any computer larger than an Intel NUC.

When the memories were new, they always had very low rates of correctable errors, e.g. one error after 3 to 6 months of continuous operation.

Nevertheless, I had several cases when a certain memory module started to have very frequent errors after several years of working fine. Due to ECC, I was able to identify it and replace it, before causing irreparable data corruption in files.

Moreover, in one case I had a laptop with ECC memory which seems to have used some poor quality SODIMM sockets. After being not used for several months (which made it more sensitive to air humidity, by not being hot as during use) it seems that the contacts of the sockets had oxidized so when using the laptop again I have seen very frequent memory errors.

Eventually, after some time wasted with investigation, I have scrubbed the contacts of the SODIMM sockets and I have reseated the memory modules, and the errors have disappeared.

ECC is somewhat less necessary in those laptops and small computers that have soldered DRAM chips, both because the total amount of RAM is small (the error frequency is proportional with the total amount of RAM) and because there are no sockets and long PCB traces (which are susceptible to electrical noise) between CPU and RAM.

At least for all computers that have socketed memory, there should have been a customer protection law forbidding the sale of such computers without ECC memory, because it is not acceptable to use a computer that may produce at any time undetectable errors.

smoldesu · on Oct 16, 2022

I get what you're saying, but nobody is going to run three trials of memtest86 on their machine and chalk the data corruption up to cosmic radiation. When memory dies are broken, they're broken. It's pretty simple to ascertain that, if you suspect your memory has gone bad.

adrian_b · on Oct 16, 2022

When the memory becomes so bad that it is obvious, you might have already lost parts of the content of various files, which may be unrecoverable.

sliken · on Oct 18, 2022

Not sure I'd say easy. I've seen many cases where the errors take more than 12 hours to show with memtest86.

tomxor · on Oct 16, 2022

Read the email. None of your scenarios fit Linus's.

TL;DR sometimes memory starts off good and goes bad later.

themitigating · on Oct 16, 2022

"- If you start getting errors (as in more than the one off cosmic flare) later you can test the memory again and determine it's damaged

- Most memory you buy for builds has a lifetime warranty"

dralley · on Oct 16, 2022

You're assuming you have any decent, prompt way of detecting emergent errors before it causes damage. There is none. Whether the corruption hits a pointer (likely to cause a crash) or a tax document is effectively completely random.

themitigating · on Oct 16, 2022

If it crashes then you know the error has occurred, you can then test for that error.

kortex · on Oct 16, 2022

That's assuming the thing you are currently doing is what crashes, and not instead creating data (a document, a compiled binary, a financial transaction, a crypto key) which is silently corrupted and the corruption is only detected way later. C.f. xerox photocopier bug (not saying that was a memory bug, example is an insidious bug).

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

dralley · on Oct 16, 2022

A lot of damage can be done before you get an obvious memory-induced crash. Software crashes all the time for reasons other than memory issues, even someone with deep technical knowledge wouldn't necessarily jump to that conclusion. And everyone else would A) probably never even think of the possibility, B) not know how to test their memory, C) probably buy a whole new laptop instead

ivank · on Oct 16, 2022

I've had memory errors make many directory entries disappear in one directory on XFS, noticed only months later. When you write out bad data and the computer says everything is fine, you won't necessarily be able to get your data back.

themitigating · on Oct 16, 2022

So let me understand this. Your data is important and you don't want even rare errors to occur, so you want ECC memory. However you use a non-journaling file system which is more likely to lose data?

justsomehnguy · on Oct 16, 2022

How a journaling FS would help in case of writing data corrupted in memory, @themitigbting?

And if it doesn't help, then why you decided to bring it as an argument?

NB: even if you use ZFS you still need backups.

ivank · on Oct 16, 2022

> XFS is a high-performance 64-bit journaling file system

https://en.wikipedia.org/wiki/XFS

(and I don't use XFS any more.)

xxs · on Oct 16, 2022

Journaling won't help as the errors are in memory 1st - then replicated to the disk.

It feels like arguing in bad faith. How did you arrive to the point XFS is non-journaling to begin with?

xxs · on Oct 16, 2022

all those points are irrelevant - when you do work with you computer, not just being frustrated your game has crashed.

running a memtest overnight is hardly a good choice.

themitigating · on Oct 16, 2022

"So the only danger of not having ECC is exceedingly rare memory errors. If those rare instances mattered that much then you should get ECC but for everyone else how is it worth it?"

xxs · on Oct 16, 2022

The fact DDR5 has auto-error correction for the data that has arrived to the memory chip should tell the story. Still not full ECC as the latter requires cooperation w/ the memory controller, yet the price in material is there already. But yes, w/o any shadow of a doubt - the 1st and most important part of any program is correctness, optimization come afterwards.

adrian_b · on Oct 16, 2022

The internal ECC of DDR5 is not good enough. Its only purpose is to restore the reliability of DDR5 to the level of DDR4, despite having smaller cells and higher throughput.

The only useful ECC for the user is the one computed in the memory controller inside the CPU, stored in the DRAM and verified after returning to the memory controller.

This allows the CPU to be aware of any error and it also corrects or detects the errors caused by electrical noise on the PCB traces or by bad memory sockets, not only those caused by bit flips inside the memory cells.

theevilsharpie · on Oct 16, 2022

So using your same rhetorical style, let me respond:

- While the error rate of memory is "low" (however you define that to mean), it is not zero, so the risk of memory errors persists.

- A machine without ECC memory has no reliable way to detect memory errors without some type of external diagnostic.

- While a memory test can (hopefully) detect faulty memory, it takes the computer out of operation for however long the test is run, and even then, it's simply a point-in-time test. It cannot detect memory errors that happened in the past, or that will happen in the future once the test has ended.

- ECC provides a mechanism to reliable detect memory errors as they occur, continuously, while the machine is running and performing useful work.

- While many memory manufacturers offer lifetime warranties on their memory modules, they cannot possibly warrant against data corruption and malfunctions caused by memory errors, which can have a much higher cost to the user than the modules themselves (and would almost certainly be more than the BOM cost difference between ECC and non-ECC modules).

- ECC has been cited by Microsoft and Linus Torvalds as desirable and something that should be broadly adopted, and ECC is commonly found in a wide variety of memory products (e.g., caches and solid state storage), with the glaring exception of main memory on consumer PC hardware.

- While ECC does cost more (all else being equal), the side-band ECC being discussed is the same effective speed as non-ECC memory. The overhead of ECC is canceled out by the ECC DIMM's extra capacity and bandwidth relative to the non-ECC DIMM.

lesuorac · on Oct 16, 2022

I tap Hitchen's Razor [1] and dismiss your entire post.

I then play TFA [2] and make the claim that I like it when computers work for at least 3 years.

I end my turn.

[1]: https://en.wikipedia.org/wiki/Hitchens%27s_razor [2]: https://news.ycombinator.com/item?id=33224680