What I am curious about, is some Nvidia cards like the A2000 have ECC, but only enough chips for a regular roundish number of RAM, like 6GB or 12GB. So when ECC is enabled, 6.25% of the RAM is used for the ECC bits. [0; 1, in the notes]
Since desktop ECC gets around this by having physically more RAM ICs (usually 9 instead of 8, for example), what is the impediment from having a similar solution to Nvidia? I'd readily take a hit to memory capacity* and performance in exchange for ECC.
Why can't the memory controller already do this?
I should note, I'm mostly thinking of my NAS. I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.
[*] in this case, 12.5% if we follow typical desktop ECC allocations
The primary problem is that on CPU memory systems, all the requests are always 64 bytes, and the entire system starting from the CPU caches and ending at the arrays in the DIMMs is designed to efficiently serve those requests at the lowest possible latency.
In-band ECC means significant sacrifice of performance on a system not designed for it. Random read throughput doesn't go down by 6.25%, it goes down by half.
> In-band ECC means significant sacrifice of performance on a system not designed for it. Random read throughput doesn't go down by 6.25%, it goes down by half.
But adjusting DDR for that could be pretty easy. Instead of a burst of 16 transfers, do 18. It's already set up to stream longer transfers when desired.
There will be more overhead than making the sizes properly match, but it shouldn't be anywhere near cutting throughput in half.
> But adjusting DDR for that could be pretty easy. Instead of a burst of 16 transfers, do 18. It's already set up to stream longer transfers when desired.
That's not really how DDR5 works. The granularity of column addresses is (iirc) 32 bytes, and you cannot do transfers that are of any length other than 64 or 32 bytes (and 32 bytes only with burst chop, which means that the bank is busy for the remaining 8 cycles). Bursts longer than 16 are really just multiple adjancent requests, with an optimized command.
You could change this, by completely changing how the memory modules themselves work, and by widening the column address for more granularity. Can't do it well by just tweaking the memory controllers.
Ah. If you make any silicon changes at all, it is orders of magnitude more expensive and "harder" than just using extra chips like normal ECC DRAM does.
> I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.
From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum). Bitrot isn't a factor, at all.
This means that ECC RAM and ZFS are completely orthogonal concerns.
If your data is important enough to warrant ECC RAM, you should get ECC RAM whether you use ZFS or not.
If you want to use ZFS (for its volume management, compression, mirroring, healthchecks, whathaveyou), you should do so whether or not you have ECC RAM.
That is bitrot: you save correct data and it’s not retrievable. The fact that it happens in RAM rather than on the storage media, controller, or I/O channel just makes it a different category.
It is also far, far more likely that an uncorrected bit flip happens outside the relatively small portion of time the kernel spends in filesystem code. This is not a ZFS-specific problem by any means.
> From my understanding, the only risk to your data from non-ECC is a bit flip in RAM, pre-checksum calculation. In that unlikely scenario, you commit bad data to disk as good data(valid checksum).
Wouldn't an option to do it twice in different memory regions be nice? I'm pretty sure in many use cases scarifying performance for greater reliability wouldn't be an issue. Given how many cores we have available nowadays it could potentially even not have that much impact on performance.
Also are there any software solutions (like a kernel patch) which would do "software ECC"? I imagine in this case performance hit would be quite devastating but it still could be acceptable trade-off for NAS-like systems where you want to have lots of RAM for dedup and cache but it's not a busy system.
There is still a race condition: if you read data from disk into a buffer, make a copy of the buffer, then do 2 checksums, the bit flip can still occur before the 2nd copy is created.
A fuller instance of the first quote:
"Through careful and
thorough fault injection, we show that ZFS is robust to
a wide range of disk faults. We further demonstrate that
ZFS is less resilient to memory corruption, which can
lead to corrupt data being returned to applications or
system crashes"
...ie, ZFS is 'less resilient' in comparison to its robust disk fault handling, not that it's less resilient to memory corruption in comparison to other filesystems. The parent quotation above implies that ZFS is more sensitive to memory corruption than other fs but that is not claimed in the referenced paper.
Market segmentation on what platforms/controllers support ECC: fine, whatever. But market segmentation of what is an “ECC DIMM” vs. a “regular DIMM”? It makes no sense that the commodity memory manufacturers have any leverage to enforce that segmentation.
Is it just laziness on the part of the platform vendors (who do have leverage) not simply allow ECC with any DIMMs by giving over 1/k {bits, lines, pages, chips, whatever-granularity-they-reason-about} to parity?
However Intel guarantees ECC will work on their "workstation" chipsets. AMD doesn't guarantee ECC will work on their desktop/workstation chipsets. You have to go up to Epyc to find a guaranteed/tested ECC.
I’m constantly surprised that it’s not commonplace to use on-disk parity files.
It’s so uncommon that the PAR3 format was never really finished and no one has created a replacement that handles subfolders.
Why I’m surprised: Not only does it solve the problem of bit-rot, but the parity files can be moved to USB sticks, NAS drives, Mobile devices, etc and the original files can be verified/repaired by any device that understand the parity file format. PAR2 is still great for photos/audio/video, as well as any flat-folder assets.
PAR is somewhat unwieldy to use. In addition to needing to explicitly create it (and it not being particularly fast, on a large enough data set), PAR2s can't be 'updated'.
The PAR3 spec allows for some limited updating, but it's far from ideal.
It often makes more sense for the file system to deal with ECC in my opinion. PAR probably makes more sense for archived files that aren't expected to change, but may be moved across file systems.
PAR2 handles subfolders by the way, just not empty folders.
No exactly: The current PAR format does not make sense for this use-case (including because of the limitations you mentioned), but IMO the technology does.
Files with on-disk ECC can be moved from cloud to cloud, cloud to desktop, filesystem to filesystem, desktop to stick, then stick to NAS all without losing ECC protection. No single filesystem can do that.
Fair question. What I’m referring to is file backup and archive for anything up to enterprise level.
So specifically: photography archives, videos (including b-roll for content producers/videographers), project backups, personal files, important documents, etc. Up to and including anything that could be posted to r/datahoarders
Some folders unavoidably have tiny files in bulk (Document backups can be like this. One other example that jumps to mind: macOS applications with translation files)
In these cases, PAR/PAR2 have issues with the block size (can only have one file per block which leads to a lot of wasted space)
2. Tracking changes across filenames
This is counterintuitive, but I’ve run in to this enough to mention it: if the item to archive is a folder where the contents might change over time, any single file might get renamed and it’s contents slightly modified. A parity file tool could look at the blocks that have not changed, recognize the rename, and “correct” the reference before doing more processing. If it’s a valid change to the file: saving the work required to recalculate the whole file and if it’s damage to the renamed file: being able to repair it simply.
3. Being able to update in-place
Sometimes the ideal is to create parity files for a folder, even if that folder is actively used (say for example b-roll that changes by 10% maybe once a month). A parity tool could update that 10% without having to recalculate the whole thing (Ideally this would be adding files similar to ‘git add’ so that someone does not accidentally add file damage to the parity set)
I see.
Yeah, PAR3 tries to address the 'lots of files' scenario better than PAR2.
But updates are problematic. You could 'delete' parts of a recovery file if the data is present. However a file being updated typically means that the original data (before the update) is no longer present, meaning you can't really 'delete' that part of the recovery to replace it with the new data.
You could try and retrieve the old data by recovering it, but at this point, it may just be easier to recreate the entire PAR set again.
If it's just adding to the recovery set, PAR3 does have provision for that.
I should thank you, though it’s also meant I’ve lost most of the last 3 days:
After our discussion I went and found the current work on PAR2/PAR3 (which, for those in the know: the current PAR3 format is not the old on the was never finished, but a spec that’s been re-written and is close to completion with real-world tools close behind) and I have a lot more hope for the future of parity files.
I still think they are wildly under-utilized (BackBlaze has always used them, but they are the only business I know of), but we might be having a different conversation in 2-3 years.
The impediment to adding more chips is the same as it is without ECC: more die space/heat/power. The reason they use a standard number of chips, I expect, is that it's easier to manage and GPUs don't care very much about weirder access sizes.
Im currently thinking of purchasing a Synology NAS that comes with BTRFS. Just wondering, do you happen to know if BTRFS also requires ECC RAM to function correctly?
It should be noted however that a bunch of cheap brands do not even offer ECC variants, and those may dominate the lower end of the price spectrum. So getting ECC memory may also involve choosing a more pricey brand.
This is irrelevant. Back in 286 30 pin simm times when every motherboard supported ECC price difference was ~10%. ECC was still supported on almost ever 386/486 board. Then came Intel with 430 chipsets artificially segmenting ECC exclusively for expensive server market HX offerings. https://en.wikipedia.org/wiki/Intel_430HX
Of course Intel segmenting ECC out of the client market is a major factor…
… which does not mean other factors are "irrelevant"…
… and also the price differential in ECC memory is not only a factor but also a metric as it shows that ECC memory has a "non-mainstay" surcharge created by the vast majority of users going for non-ECC. If demand were equal, the price would be 9/8 = 1.125×.
Demand cant be equal when nothing on the desktop supported ECC for the last 22 years (2017-1995), and even now its only a select fragment of ~28% market share. Things would look different if you could reuse server ram in any desktop board. 16-24GB DDR3 ECC ram sticks sell at 4x discount compared to regular old DDR3. I havent looked into DDR4 difference.
To be pedant - this is HN after all - market segmentation is a business policy and relates to how a business is run. The word politics is used correctly here but in a somewhat old fashioned way. Policies would probably be more usual. Linus indeed bemoans that the industry is using ECC as a differentiating factor.
The market is always downstream from politics. Karl Marx famously used the term political economy to reference both political and economic forces, because neither is coherent without the other.
The term political economy far predates Marx. I've not heard of Marx famously using it, but I don't doubt he did use it since it was an established field of inquiry at the time--the predecessor to the modern field of economics.
“In the 21st century, Karl Marx is probably the most famous critic of political economy, with his three volume magnum opus Capital: A Critique of Political Economy as one of his most famous books.”
I'm not sure what this is supposed to show. Marx being famous for his critiques of political economy is not the same as Marx famously using the term political economy in a certain way.
Political economy as an area of inquiry took off in the late 18th century--in no small part due to Adam Smith--and was already very well established before Marx was born [1]. Absent evidence to the contrary it seems questionable to assert that Marx's use of the term was anything other than the common use of the time.
I believe this is more about Intel who for probably a decade have sold consumer CPUs with memory controllers capable of using ECC memory but either disabling the feature in the chip, or more recently with Alder Lake, locking the feature behind an enterprise motherboard chipset despite the chipset not being relevant to ECC support.
Having to pay a 20% premium on RAM for the stability of ECC is one thing, having to pay double, triple, or quadruple the system price to disable arbitrary locks is another.
With AMD that limitation is somewhat lifted with normal consumer CPUs and quite a few motherboards having support. Unfortunately the RAM itself has very poor availability. Very few manufacturers offer it and it tends to stick to the standard speeds without any of the overclocked-from-factory parts available with ECC. That last bit is strange because ECC is reportedly very good as a validation tool for overclocking builds so it seems like the RGB-lighted market could end up valuing it as a high-end feature too. One of the RAM manufacturers should rebrand ECC to "Extreme Clocking Capacity" and start selling it as a feature to differentiate their RAM from the competition.
> Unfortunately the RAM itself has very poor availability.
That doesn't match up with my experience, even though I've seen it mentioned several times by people as if it's the case.
For example, when I went looking for ECC UDIMM sticks for my Ryzen 5600X build a year or so ago, I went looking at the website of my local computer supplier.
But it's not like theres any kind of availability problem. And the Kingston ram I bought happily overclocked to 3200MHz without any effort on my part. Using an ASRock B550M Pro4 motherboard for reference, if that's useful.
It will depend on where you are. I'm in Europe and it used to be that the usual retailers had no DDR4 ECC options at all. Now there are 1 or 2 available and still no options for DDR5 ECC. Looking at those pages your suppliers seem much better than what I've found so far and yet also have no DDR5 options. Meanwhile DDR5 non-ECC is very broadly available and DDR4 non-ECC has great and mature options.
> That last bit is strange because ECC is reportedly very good as a validation tool for overclocking builds
Well, yes, but it's not the only tool. In practice you'd use some validation software suite to find a reasonable stable configuration and afterwards prey. Overclockers, who rather than spending extra money on a CPU which assuredly delivers requested performance stress their hardware beyond specifications, are least likely to pay for the extra bits.
If non-ecc RAM is good enough for most fault sto only cause noticable problems rarely, its possible to sell slightly less perfect modules than with ecc, where even otherwise unnoticable problems are detected and reported.
The ratio of modules returned to the manufacturer is probably higher for ecc than non-ecc. Unless ecc modules are already better selected than non-ecc at the manufacturer.
Because given the same chips you can push ECC RAM to speeds beyond what non-ECC RAM can do before corrupting. You can push beyond what the chips are capable of and rely on the error correction to keep your system stable.
ECC in common desktops/workstation/servers will correct all single bit errors and detect all 2 bit errors.
So sure you can run dimms every so slightly faster and fix the occasional single bit flip, but even a single double bit flip and an process or your kernel crashes.
Seems much saner to go for a safe, robust, and reliable system at standard clocks in ECC instead of trying to get slightly more performance which increases the chances or errors, corruptions, crashes, and shorter service life.
How many seconds do you need to think about this to come up with counter examples? There are entire businesses based on selling things which are better or helping buyers find them, suggesting that the rush of perceived superiority isn’t warranted.
Some people can’t afford to buy anything but the cheapest but in most cases it comes down to not thinking that they use it enough to matter (e.g. the common homeowner advice to buy a cheap tool the first time & replace it better if you break it), not having a good way to tell whether something is actually better, or the market being such that there is a huge gap between the product bands. ECC falls into the latter two categories: the average buyer isn’t familiar with the issue and probably thinks the outcome would be a crash rather than silent data corruption, and Intel’s market segmentation means that you don’t have a choice in the consumer space and have to move into far more expensive and limited categories. It’s not reasonable to say price-sensitive consumers are the problem when that’s also saying “stop buying laptops”.
My 4 years old laptop has ecc. To see if anything has changed I just checked the recent models. The manufacturers website didn't provide any possibility to filter for ecc or mobile Xeon, but an explcit text search returned some results. Ecc still exist in laptops, its just a bit difficult to find.
Why pay for something I don’t need when there’s the option to not pay for it? - me
Just yesterday, I bought an extra 64GB for my home Linux PC. I absolutely couldn’t care less about it crashing or calculating the wrong result every blue moon (in practice: never), but I did choose the RAM sticks that were $10 cheaper.
That's totally fine. The problem is people that do need it not having the option to pay the extra without getting a totally different "workstation-class" computer.
torrents have checksums on the blocks, which your client already checks, because the internet doesn't guarantee packets arrive without corruption. it doesn't matter where the corruption comes from.
I’d go dither than that and claim that almost everything of importance has checksums: Google Docs, Google Sheets, Git submissions, all my important web accounts, the traffic with my bank website and so forth.
When I look at my daily home computer usage, it’s remarkable how little I calculate on my local computer that’s actually with protecting.
How many people complains about memory corruption? This is a ridiculously rare issue, which impact is minimal, if not null, for so many people. There's an actual cost to provide ECC memory though, there's actually more hardware necessary to cover it. Obviously if more would need it, it may lower the cost because of the scale, but it will still be more. For some it's important, and for sure they will pay for it, so let them pay for it...
It's not like it wasn't available, Linus just forgot to put a reminder to buy some once available (backordering them would have been even more prudent).
Ridiculously rare? That's why CDs, DVDs, blurays, caches, SSDs, spinning rust, etc all have ECC, right? Even ancient file transfer protocols like xmodem and zmodem has error correction.
Bitflips aren't uncommon, it's just compression makes them much more noticeable. Without compression users might just see an occasional application crash, and just relaunch the proces.s Seems common for people collecting photos, music, etc for years go back and find some corrupted. It's hard to say exactly what happened, but with compression a bit flip it turned into a seriously corrupted file.
Keep in mind that, radiation caused bit flips are rare, some failure in the chip, pin, dimm, dimm connector, motherboard, CPU pin, CPU are much more common. It's MUCH nicer to see "error on dim X, row Y, column Z" then a randomly crashing machine. Even Linus sounds like he spent hours tracking it down, and thought he was hitting a kernel bug.
Isn't it worth 1/9th more memory chips to make the system robust in the face of wide variety of errors that can corrupt memory, which can lead to corrupted storage?
Largely ignorant buyers of mass-market, commodified devices lead to intense cost wars that shed features that aren't "essential." If a buyer wants ECC in a laptop or desktop, typically they have to purchase a workstation-class machine that costs 4-5x of low-budget models.
Memory issues can be caused by the PSU, motherboard, the CPU, or the memory itself. Personally, I always run memtest86 and memtest86+ for 2 days on any new components.
The mitigation of bitsquatting requires ECC in network gear also. Furthermore, ECC isn't just about the memory type but having integrity in data buses, caches, storage, and network protocols also.
Home system is 96 thread, 512 GiB (16 x 32GB ECC Registered DDR4-3200), numerous SSDs and HDDs running RAID.
It does look like some Intel 12th gen and later parts support ECC [0, 1] - but not all as I don't see it on the i3, Pentium or Celeron parts... That said you apparently require a W680 chipset motherboard [2], so that's still going to be expensive. I much prefer the situation with AMD where ECC should work on all parts, even if not all motherboard mfgs enable or validate it.
It's funny because the Pentium G1610T that my old Microserver G8's came with did have it. Because those Pentiums are Xeon-Pentiums somehow. But they lack hyperthreading and AES-NI making them really crap for server tasks.
Intel's marketing is really in a league of their own. As soon as their branding starts making sense they will change it.
I'm actually still running one of those old G8 Microservers, though I swapped the crappy Pentium for one of the Xeon parts. I've looked a few times and I still haven't found anything that would be a good replacement.
The Microserver G10 Plus is not too bad. Replaceable CPU, iLO option, half the size of the G8 (though external power brick)
The regular G10 one is pretty useless obviously. Soldered AMD CPU, cheap design etc. No iLO. It's more of a home entertainment thingy, nothing enterprise-class at all.
I agree I'm looking for something lower power too. I have three G8's but they are each doing 50W idle so I can't run them 24/7. And this is with the lowest-TDP processor that is available for them! (E3-1220Lv2, 17W TDP). The iLO alone is consuming 5W even when it's off.
I'm not sure how much power the G10 Plus draws but I haven't really considered it. It's too expensive still (I bought all my G8's for 175-200 euro each and 2 of them even had a 60 euro cashback on top of that!! So I barely paid more than 100 for them brand new. Crazy cheap pricing for a well-built 4-bay server. Each of the drives inside them cost more than the server itself :)
But 15-20W I think is very ambitious with 4 3.5" drive slots. 30W would be doable I think.
I'm confused, I've built a couple of computers and I've never had issues with the power supply. The only weird issue I've had so far was with a first generation ryzen processor causing hard locks on the computer when it was left on for more than a couple of days.
Also with some bad memory causing really really weird issues all across the system.
Power supplies? And some of the cheap ones I've used have just done their job and not caused any problems.
Do you have any examples of what's going on with this?
It's probably that they made good coolers and then decided to capitalize on the name by selling everything under the moon—i.e. reselling someone else's stuff.
The total wattage of a PSU isn't the issue. As another comment points out, it's the number of rails and amperage per rail. If the motherboard is drawing a lot of amps because of the CPU and you stick another high current device on it, you will run into issues. If you had a SATA SSD that drew the same current you wouldn't likely had any issue since it would have been on a different rail.
When possible I try to buy from companies that make a product, as opposed to just name/market/make cool stickers.
ThermalTake is just a reseller, Season is an example of a company that makes power supplies. Seasonic sells to others for relabeling, but also sells direct to consumers.
I think a lot of people still don't know this stuff and it's quite reasonable to expect people not to know this stuff.
Not everyone is an electronic engineer, and if the overall wattage doesn't give the information people need, the sales/marketing should be forced to provide the necessary details right there on the box in "plain language"; RAM speeds are still a pet-peeve of mine.
That said, I've been building PC's for ~25 years and I'm yet to have a PSU fail on me. Anecdotal I know, and with some skew as I've always tended to build near or at the top end.
That said, the Ryzen hard lockups were a real disappointment for me, after spending north of £3500 on my desktop in the OG Ryzen era with an R7 1800X, I was left with a machine that frequently hard-locked when compiling code on all cores and AGESA is such a mess that if it boots at all, it takes several minutes at times to make it past POST.
I really hope the next gen build I make, whatever it is is more stable because I'm not planning to go back to Intel any time soon.
It's believed to be a fault with the Ryzen chips, though don't believe AMD ever officially addressed it (although they were RMAing them repeatedly for hte issue). I sent it back a couple of times for replacements but the lottery wasn't in my favour, and although the one I ended up with is better, it's not perfect;
I was an early adopter, and I accept that but it still left a bitter taste none the less, and ultimately it just meant that my pretty expensive computer barely got used because I hated dealing with it (still have it but barely boot it these days).
It's more like: If you buy a very old PSU it won't devote enough to 12v.
The stock grey metal models are going to be out of date, but they're not that out of date. At this point I'd expect them to have a reasonable rail balance for a modern computer.
I can't speak to what the other poster was talking about but NVidia 3000 series caused a huge stir because it would jump in power consumption much faster than previous generations and many power supplies couldn't provide the wattage fast enough and would brown out the system. It was especially bad for low and mid-tier PSUs that were nearing their rated wattage limit at peak power.
edit: I misrembered some aspects of this, The 3000 series was actually spiking much higher than rated TDP and tripping overcurrent protection versus just browning out. NVidia did recommend 850w PSUs for that first release of cards for that reason but iirc some 850w still had issues.
I put heavy blame on Nvidia for that one. There is no reason their cards should be pulling these huge transient spikes. By putting these cards out with this flaw, then pointing at the power supply manufacturers, they are causing a problem and blaming someone else for it.
The power supply is just doing its job, when you far exceed its rating for even a short amount of time it's going to try to shut down to protect the computer because it thinks something inside of your computer is burning up.
you're right, I forgot about the fact the spikes were far in excess of the TDP of the card, NVidia should have taken a lot more responsibility rather than let PSU mfgs take a lot of the heat.
It (3090) actually tripped out my quite expensive quite high-end seaSonic prime titanium 850 watt power supply. But it was perfectly fine on my old EVGA bronze 850 watt
My team ran a desktop fleet for a few years. Power supplies were a top 3 failure component, and tended to go in waves when an OEM got screwed by counterfeit capacitors or other components.
My guess was that bad power was a contributor to many other issues, but the nature of the SLO was such that more complex issues resulted in a device swap.
Also, any kind of dirty environment drives higher AFRs.
Two builds is nothing. Since everything computers do is powered by electricity, a good PSU is paramount and literally the only thing I never go cheap with. Consider yourself lucky because a bad one can easily wreck your entire system.
Cheap cases used to come with crap power supplies even 15 years ago. It's great that's no longer the norm. Everyone wised up a bit on this one
Not really two. I have used 6 power supplies in total.
Maybe I've just been lucky or I've just been buying the quality stuff, two of the ones I'm running now are 1,200 Watts, the other two are in computers that don't use much power because they are basically idle servers all the time. They are also semi reasonable quality.
But I have been running a computer on a 450 w power supply that was very much having its limit pushed.
I also did run on a bottom of the barrel super cheap 500 w supply that I bought in the middle of 2020 when all of the supply crisis was happening.
That is why I was curious. My experiences didn't line up with what the OP was talking about, so I was interested to see examples of what was going on.
I think PSUs are a very common factor in system instability. If you're suffering from lock ups or reboots and you aren't overclocking, I would look to the PSU first. That being said, you can go a little crazy if you pay too much attention to detailed PSU benchmarks. there's a threshold where a PSU is good enough. I mainly look at the efficiency curve when shopping for a PSU. A more efficient supply won't get as hot for a given load, and so its components should last longer, and should have more headroom to degrade before they eventually do fall out of spec.
I reused a 10 year old "gold" 850W Corsair PSU in my zen3 build. It had some noticable inductor whine when the system was idle and I moved the mouse. After a year I got a 6800xt GPU, and it kept trucking along for another few months. I did some GPU overclocking and stress testing one Saturday, and my PC wouldn't start the next day. It had a good run for PSUs I suppose; it outlasted its 7 year warranty. I bought another 850W Corsair PSU to replace it, with "titanium" efficiency rating and a 10 (or was it 12?) year warranty.
I see issues all the time with external power cubes. Devices goes wonky, look up the spec, find it's DC 12v @ 2 amps or whatever and replace the power supply. About 90% of the time the device is happy again.
This took me a while to learn when I built my first couple of machines. I never go cheap on motherboard or power supply nowadays. Not saying I buy top tier, but I don't go cheap and I refuse to help anyone build one with cheapo power supply.
I would argue it's amateur PC builders not speccing the PSU correctly that cause these issues.
In prebuilt PCs you don't see such issues because the manufacturer controls everything. They might not make the PSU itself but they will certainly get it made with the right specs.
In a home-built PC you need to consider not just the total power (wattage) but also the number of rails and current (amperage) per rail. Exceed that and you will run into stability problems. Also, cheap means you get what you pay for. Get something made by Delta and Seasonic and you will have far fewer problems. Some of the cheap no-name brands don't even specify what kind of rails are in there.
It's wild to me that we gave up hardware error correction on memory at the same time we increased memory sizes about 1000x, shrinking the die (and thus reliability) by a roughly similar amount.
This is true, but even today bit flips per GB/hour are still really low.
However failures in the memory chip -> chip pin -> dimm -> dimm connector -> motherboard -> CPU socket -> CPU pin -> CPU are pretty common. Sure ECC helps with random bitflips, but it's also very useful to diagnose something is broken in the CPU <-> memory chip pipeline. It's very frustrating to debug something when the main sign of the problem is a reboot.
I mostly just love the idea of Linux kernel development slowing down or halting because Linus got a bad stick of RAM and doesn't like working on his laptop.
It's not that he doesn't like his laptop it's just that horsepower wise it simply doesn't compare to his monster Threadripper machine when it comes to compiling kernel trees.
It's not politics, but economics. Near textbook perfect price discrimination (market segmentation) by a profit-maximizing quasi-monopoly (Intel) to extract the maximum surplus out of consumers.
I miss the old, less politically-correct Linus who didn't pull his punches.
> PS. And yes, my system is all set up for ECC - except I built it
during the early days of COVID when there wasn't any ECC memory
available at any sane prices.
How much was ECC ram at that point? Linus is incredibly wealthy and was building a machine that would be used to gate Linux releases, so I'm very curious how much was not "sane" for his use case.
When I last looked DDR4 ECC UDIMMs were 2x the price of non-ECC UDIMMs of the same capacity.
This feels very wrong when the only difference is one additional chip that in terms of material only should increase the BOM price about 12.5% (going from 8->9 memory chips).
It's 3-5 orders difference in the volume (and most of the time people buy high capacity modules) and ECC errors makes very obvious the need of replacing the module, while unregistered errors on non-ECC modules are just other 'something gone wrong' type and couldn't be diagnosed easily.
Also, onr of the primary markets for ECC UDIMMs are server/pro-workstation, thia alone.adds at least 10%
Research has shown that a computer with 4GB of memory has a 96% chance of having a random “bit flip” every three days. That's a crazy high chance of data corruption occurring on your computer.18 Mar 2021
These can all not be corrected, but are extremely rare. A 1 Gigabit ECC DRAM contains 16 Million blocks of 64 bit datawords. Per each of these 64 bit words, one error is correctable. In other words: Statistically one out of 16 million hits might be a double-bit error.
All my workstations at home use ECC Reg RAM purchased online from ebay/alibaba. Most of them are from vendors selling "decommissioned" server parts. There are several good incentives to do that -
such second hand server parts are cheaper than normal consumer grade parts
you get extra protection & performance
256GB (32GB x 8) Samsung DDR4-2933 ECC Reg RAM for USD $550, it is pretty hard to say no to that.
I replaced the RAM in my storage machine because of this earlier this year too.
ZFS was reporting all sorts of errors, yet drive tests were showing no issues. I bought about 3 new drives before I realised what the root cause was. A real PITA. Next time I do a hardware refresh, ECC is definitely on the menu.
In reality ECC is like twice the price, the CPU support is close non-existent too (Intel just has been disabling in the memory controller for ages... unless it's an i3 laptop - then it's available again)
Just try and buy a reasonable non-server class machine that has ECC.
AFAIR, every AMD64 CPU has ECC support. But not every motherboard had necessary layout and BIOS support. That's why I'm 20 years with AMD and choosing components carefully. Every system with > 4GB RAM should have ECC. Proven decades ago.
Not sure where you are. But I just compared Crucial DDR5 32GB DDR5-4800 (PC5-38400) and the non-ECC (on CDW.com) is $186.99 for part (CT32G48C40U5) and $229.99 for part (MTC20C2085S1EC48BA1R) a 22% premium.
All Alder Lake chips (the ones shipping in volume today) support ECC, check ark.intel.com to be sure. For example the popular i7 model (https://ark.intel.com/content/www/us/en/ark/products/134591/...) As do all Ryzen desktop chips (5000 and 7000) do as well.
Granted popular consumer products don't have ECC, but it's pretty straight forward to built it yourself. Just make sure the motherboard you buy (Intel or AMD) supports ECC.
It's also slower. So if you also want performance parity you need faster chips. The scam is with Intel making it hard for consumers to opt into ECC without getting the blessed chipset/CPU combos.
The problem is not just the price, but also that you need a “server” CPU and/or mainboard/chipset, which comes with additional trade-offs. Intel has been artificially restricting ECC support to its server product lines.
> it was literally a DIMM going bad in my machine randomly after 2.5 years of it being perfectly stable. Go figure.
I suspect that the final phrase above is sarcastic, as well as instructive and even possibly bragging, and Linus is perfectly aware that his machine gets an exceptional amount of use compared to the average Peanuts character. RAM wears out, "And some of the degradation is noticiable if you use it intensively (as servers do)."[1]
IIRC Intel "consummer" parts are qualified for 3 (or 5 ?) years of usage at a 30% duty cycle (and you might not find that figure publicly, but honestly you should, and intel should be mandatted to publish it). Now that is likely not exactly what RAM vendors are doing, but likely to give a rough idea of what you should expect from consummer electronics. I'm not even sure all small Xeon have better specs, but for sur you can source some models in volume for a quite low extra cost, qualified for a higher duty cycle and more years of usage.
> qualified for a higher duty cycle and more years of usage.
Which is what? Double the 30% duty cycle? Would it be so beyond the realm of belief that Linus' box has a 80%-90% duty cycle? I'm not exactly sure where your position falls, that the RAM was defective, or that it lasted longer than expected given its increased use.
Expected by who? Those who have access to private info? As far as I'm concerned, I don't see why I should expect anything less than 100% usage if a vendor does not warn me otherwise, and countries exists where mandatory warantee is 2 years.
And that's just about the law. Don't get me started on resource consumption and on my perfectly fine 10 years old computers I'll be forced to stop using soon because of litteral planned obsolecence (of MS, that's another story but a major actor of the computer industry too, and that will yield even more environmental destruction)
I've been through this ordeal recently, but I'm probably missing something anyway.
You need to have a compatible CPU/motherboard/chipset. For normal CPUs: AMD Ryzen non-Pro APUs don't have support for it, the rest of AMD's CPUs and chipsets have unofficial support for it. You'll have to check the motherboard vendor's support page if a certain board also has support for ECC. Then you need ECC memory modules and you should stick near the qualified vendors list (QVL) here since systems are kind of pickier with ECC memory. For Intel, you're out of luck except for the W680 chipset, but motherboards seem to be scarce.
For high-end desktop (HEDT) and workstations CPUs: AMD's Threadripper lineup have official ECC support, but still check with the motherboard vendor first. For Intel, most Xeons should do it, but check before you buy. The same caveat about motherboards applies here, too: Check if there's ECC support first and stick to the QVL to be safe.
I just built a desktop machine with such a board, a matching Ryzen and 128GB Kingston ECC memory. Works like a charm, the only problem is the on-board Intel Ethernet chip which ignores Wake-on-LAN (although it's supposed to handle it) so I had to add a PCIe ethernet card to get WoL running. Asus and Intel seem to discuss whose fault it is since two years, sigh.
You need system software to make it work right. You want to configure your system to halt as soon as possible after uncorrectable errors. You also need to prominently log correctable errors. How you achieve this is going to vary by hardware platform and operating system.
Linus uses a Threadripper machine which supports ECC, and non-ecc.
Most Mobos support non-ecc and maybe ECC if it's AMD and the supplier wired it up. (ASUS and someone else I can't recall seem to do so, Gigabyte does not appear to)
Thanks! ASUS pro MB here (on the sole non-Apple device around here lol) so maybe it will work. Also will keep all this in mind when upgrading, which could be in the cards.
I understand why any commentary from Torvalds is notable, but when I read this my takeaway isn't that we need ECC everywhere, its that a hardware component failed much sooner than it should have (I would have liked to have known who the manufacturer was) and that his ability to accurately diagnose the failing component was hampered by a shortage of effective diagnostic tools and perhaps his own lack of hardware troubleshooting experience. But maybe I'm missing something here.
I wonder if he couldnt have used the memtest kernel parameter:
memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
performed. Each pass selects another test
pattern from a given set of patterns. Memtest
fills the memory with this pattern, validates
memory contents and reserves bad memory
regions that are detected.
DDR5 does have "on chip ECC", but that doesn't protect as much as full ECC. Bit errors can be introduced after the bits leave the chip. It could happen on the ram chip pins, in the DIMM, the DIMM connector, the motherboard, the CPU pins, or in the CPU.
So it's not just the error reporting, it's protecting against errors on the whole pipeline, not just inside the chip.
Historical note: PC RAM modules (SIMMs) in the 80's and 90's commonly had 9 bits per byte. The extra bit was parity, a simple form of error detection. Eventuality, as a cost saving measure, some modules faked the parity bit.
I don't get it for SFPs either, it seems easy enough to get SFP modules coded for whatever NIC or switch you have, so it seems like it doesn't even work.
Intel is a big company, so during its many decades of activity there have been many Intel employees who have done a lot of good things for the progress of the computing industry, but there have been also too many Intel employees who have made horrible decisions which have caused millions of Intel customers to lose a large amount of money and time, which is impossible to evaluate, because in such cases there is not enough information to disambiguate the causes of various incidents between various kinds of hardware problems and software bugs.
The most damaging Intel decision was about the ECC memory, but there were also many others that are less impactful, e.g. the various ugly workarounds for the laziness of Microsoft of adding various necessary features to Windows, e.g. the System Management Mode or the Management Engine.
The problem with the ECC memory has been created by Intel from 1994 to 1995, when Intel has split their top line of CPUs into two branches, Pentium (the second generation of Pentium, @ 90 or 100 MHz) and Pentium Pro.
For more than a decade, since the introduction of the IBM PC, all compatible personal computers had implemented memory error detection, even if it was possible to use memory modules without error detection, if one did not care about the reliability of the computer.
With Pentium and Pentium Pro, Intel has decided to introduce a market segmentation feature and they have reserved the use of ECC memory for the "professional" Pentium Pro, while removing the support for memory error detection from the Triton chipsets made for the Pentium CPUs (at that time, before AMD integrated the memory controller, the memory controller was still a part of the external northbridge chip).
The successors of Pentium Pro have been rebranded as "Xeon" and they have continued for a long time to be the only Intel CPUs with support for ECC.
The so called "market segmentation", even if it is practiced by a large number of companies, is just a combination of fraud with blackmail, which should have been forbidden by law in most cases.
To introduce market segmentation, a company takes advantage of the fact that the majority of its customers are naive and they are not able to evaluate correctly the quality of a product that they purchase.
The company then uses this fact to extract much more money from the fewer customers who actually know how to evaluate the quality. For this, the company convinces the naive customers that a lower quality is good enough for them and then the company lowers by various means the quality of the products sold at a decent price, in order to able to request an overprice from the quality-aware customers, who are forced to pay, because they do not have an alternative, since the products at the right price do not have the right quality.
This scheme would not work in a competitive market, but when there are few competitors they usually follow the example of the first company which did that and they introduce the same market segmentation policy, because this will increase the profits for all.
Now, because AMD did not disable ECC in Ryzens, even if AMD has provided much worse software support for this feature than Intel (in the EDAC device drivers), at least until recently, Intel has been forced eventually to enable ECC in many models of Alder Lake and Raptor Lake.
Nevertheless, after many decades of lack of support in consumer CPUs, there is a lot of inertia to overcome in the availability of ECC.
Even if now it is easy to find Intel desktop CPUs with ECC support, the ECC support on motherboards requires the special workstation chipsets, so the socket LGA 1700 motherboards with ECC support are hard to find and they are either expensive or with underwhelming features.
All Intel CPUs for mobile applications (the U, P and H series) continue to lack ECC support. Only the HX series for laptops, which are desktop chips packaged in BGAs, have ECC support.
Previously, even AMD had implemented a market segmentation by disabling ECC in their laptop CPUs. That has changed in the Ryzen 6000 Rembrandt series, which have ECC support. However the ECC support remains theoretical, because until now no laptop manufacturer has introduced any laptop with an AMD mobile CPU and with ECC memory, so there is no competition yet for the Intel mobile workstations.
Yes. It should have been the default all along, but Intel has been using a strategy of market segmentation for many decades. This has made ECC rare in home computers (even in servers it isn’t ubiquitous), and more expensive than it needs to be. You have to pay extra for the CPU that supports it (even if the Xeon chip you buy is otherwise identical to the i7 you could have bought). The motherboard costs extra too, naturally. The memory itself has to have a ninth memory chip on it so you expect it to cost a little more, but usually it costs a lot more and isn’t made to the same specs. You end up with the sad choice of overclocked ram or safe ram in your gaming machine, and most people go with fast.
Doesn't have to be default, just nice to have it as an option.
Currently, it is difficult to buy a NAS (where bit flips are arguably more important to avoid) with ECC memory, unless custom-building and carefully selecting parts.
- The amount of memory that leaves the factory that has errors is a low percentage, not sure what it is but we can all agree it's low.
- You can run memtest when you first install memory for several hours or so to be 99.99% sure your memory is good
- There are outside influences and EXTREMEY RARE cases (cosmic radiation etc) where you may still get an error. If you had ECC it would protect you.
- If you start getting errors (as in more than the one off cosmic flare) later you can test the memory again and determine it's damaged
- Most memory you buy for builds has a lifetime warranty
- ECC costs more and reduces performance by some amount
So the only danger of not having ECC is exceedingly rare memory errors. If those rare instances mattered that much then you should get ECC but for everyone else how is it worth it?
The argument in this post is that he wouldn't have had to go through this since the ECC would have covered the error. I just don't see the value of adding this safety system.
The problem with faulty memory (that could go bad after purchasing) is not knowing about it. ECC doesn't always protect you (it can only fix one bit flip, not more), but at least you will know the memory is bad. You will not keep working with data that is being corrupted.
Yes, early warnings about bad memory modules are probably the most useful ECC feature.
I have always used only ECC memory in any computer larger than an Intel NUC.
When the memories were new, they always had very low rates of correctable errors, e.g. one error after 3 to 6 months of continuous operation.
Nevertheless, I had several cases when a certain memory module started to have very frequent errors after several years of working fine. Due to ECC, I was able to identify it and replace it, before causing irreparable data corruption in files.
Moreover, in one case I had a laptop with ECC memory which seems to have used some poor quality SODIMM sockets. After being not used for several months (which made it more sensitive to air humidity, by not being hot as during use) it seems that the contacts of the sockets had oxidized so when using the laptop again I have seen very frequent memory errors.
Eventually, after some time wasted with investigation, I have scrubbed the contacts of the SODIMM sockets and I have reseated the memory modules, and the errors have disappeared.
ECC is somewhat less necessary in those laptops and small computers that have soldered DRAM chips, both because the total amount of RAM is small (the error frequency is proportional with the total amount of RAM) and because there are no sockets and long PCB traces (which are susceptible to electrical noise) between CPU and RAM.
At least for all computers that have socketed memory, there should have been a customer protection law forbidding the sale of such computers without ECC memory, because it is not acceptable to use a computer that may produce at any time undetectable errors.
I get what you're saying, but nobody is going to run three trials of memtest86 on their machine and chalk the data corruption up to cosmic radiation. When memory dies are broken, they're broken. It's pretty simple to ascertain that, if you suspect your memory has gone bad.
You're assuming you have any decent, prompt way of detecting emergent errors before it causes damage. There is none. Whether the corruption hits a pointer (likely to cause a crash) or a tax document is effectively completely random.
That's assuming the thing you are currently doing is what crashes, and not instead creating data (a document, a compiled binary, a financial transaction, a crypto key) which is silently corrupted and the corruption is only detected way later. C.f. xerox photocopier bug (not saying that was a memory bug, example is an insidious bug).
A lot of damage can be done before you get an obvious memory-induced crash. Software crashes all the time for reasons other than memory issues, even someone with deep technical knowledge wouldn't necessarily jump to that conclusion. And everyone else would A) probably never even think of the possibility, B) not know how to test their memory, C) probably buy a whole new laptop instead
I've had memory errors make many directory entries disappear in one directory on XFS, noticed only months later. When you write out bad data and the computer says everything is fine, you won't necessarily be able to get your data back.
So let me understand this. Your data is important and you don't want even rare errors to occur, so you want ECC memory. However you use a non-journaling file system which is more likely to lose data?
"So the only danger of not having ECC is exceedingly rare memory errors. If those rare instances mattered that much then you should get ECC but for everyone else how is it worth it?"
The fact DDR5 has auto-error correction for the data that has arrived to the memory chip should tell the story. Still not full ECC as the latter requires cooperation w/ the memory controller, yet the price in material is there already. But yes, w/o any shadow of a doubt - the 1st and most important part of any program is correctness, optimization come afterwards.
The internal ECC of DDR5 is not good enough. Its only purpose is to restore the reliability of DDR5 to the level of DDR4, despite having smaller cells and higher throughput.
The only useful ECC for the user is the one computed in the memory controller inside the CPU, stored in the DRAM and verified after returning to the memory controller.
This allows the CPU to be aware of any error and it also corrects or detects the errors caused by electrical noise on the PCB traces or by bad memory sockets, not only those caused by bit flips inside the memory cells.
So using your same rhetorical style, let me respond:
- While the error rate of memory is "low" (however you define that to mean), it is not zero, so the risk of memory errors persists.
- A machine without ECC memory has no reliable way to detect memory errors without some type of external diagnostic.
- While a memory test can (hopefully) detect faulty memory, it takes the computer out of operation for however long the test is run, and even then, it's simply a point-in-time test. It cannot detect memory errors that happened in the past, or that will happen in the future once the test has ended.
- ECC provides a mechanism to reliable detect memory errors as they occur, continuously, while the machine is running and performing useful work.
- While many memory manufacturers offer lifetime warranties on their memory modules, they cannot possibly warrant against data corruption and malfunctions caused by memory errors, which can have a much higher cost to the user than the modules themselves (and would almost certainly be more than the BOM cost difference between ECC and non-ECC modules).
- ECC has been cited by Microsoft and Linus Torvalds as desirable and something that should be broadly adopted, and ECC is commonly found in a wide variety of memory products (e.g., caches and solid state storage), with the glaring exception of main memory on consumer PC hardware.
- While ECC does cost more (all else being equal), the side-band ECC being discussed is the same effective speed as non-ECC memory. The overhead of ECC is canceled out by the ECC DIMM's extra capacity and bandwidth relative to the non-ECC DIMM.
What do you need me to provide a source for? I normally don't cite for common knowledge and generally accepted facts. However what I consider common knowledge in this community could be wrong.
But the value is not having to go through what he did, right?
And the whole points made above are pretty much the exact political points Linus is referencing. It’s clear in his opinion that ECC should just be normal, or at least not so elite.
It's the policy of CPU manufacturers to separate ECC/Non-ECC Support.
For example, 12th gen intel CPUs that already exist, can now support ECC because a new chipset enables it. It was "policy" to not release ECC support for a period.
Since desktop ECC gets around this by having physically more RAM ICs (usually 9 instead of 8, for example), what is the impediment from having a similar solution to Nvidia? I'd readily take a hit to memory capacity* and performance in exchange for ECC.
Why can't the memory controller already do this?
I should note, I'm mostly thinking of my NAS. I know ZFS can be run without ECC and some consumer solutions do. However, it seems ZFS should be run with ECC. I've already experienced observable bitrot with older images and video files, I'd rather not let it progress.
[*] in this case, 12.5% if we follow typical desktop ECC allocations
[0] https://www.nvidia.com/content/Control-Panel-Help/vLatest/en...
[1] https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/ind...