Once again, I pine for ECC memory on my Laptop. I know you can get ECC SODIMMS, I got 16GB worth for a Supermicro ITX motherboard. And while the paper talks about multi-bit errors getting through ECC (which is certainly possible with enough flips) single flips causing alerts and double flips causing halts would really get your attention that something bad was happening. As opposed to silently sitting there while my memory is shredded.
Intel cripples their "consumer grade" processors by locking out the ECC DRAM interface. This forces server vendors to buy "server grade" processors. There, you get all the good error-correction stuff.[1] The fraction of die space devoted to these features is small; there's no reason they couldn't be provided on all x86 family CPUs. It's purely a market positioning thing.
AMD leaves the ECC hardware enabled on most of their parts.
Not just laptops! SQL Azure doesn't use ECC memory[1], which might suggest the rest of the Azure platform doesn't, either. I haven't found citations for AWS using ECC, so perhaps they don't. Maybe this could be used to break out of VMs on those platforms.
This may be just speculation, as the memory density in cloud hosts is rarely possible with non-ECC memory. I've found that when purchasing RAM for systems, it's fairly common for server and multi-rank to imply ECC, although I've had to look at product sheets to verify that.
Now, I could be wrong, but it would be quite a surprise to find out that any of the cloud services are not using ECC. I suspect they all are, but they don't advertise it.
Well I mean I asked in that thread, and MS replied stating they simply do not need ECC. I quoted a line from Google's study on memory errors, and Azure replied: "In our scenario, we have not seen bit error rates that align with the quote you mention".
I suppose I could just spin up a 56GB instance and let it run a memtest for a week and see, right?
It requires more expensive, compatible DRAM right? Knowing that, it shouldn't really be super surprising. Enabling it on die is just one piece of the equation.
The problem is that if you say "fast, cheap, reliable, pick two" people will pick "fast" and "cheap". Even people who might ordinarily worry about whether their workstation is silently corrupting their data.
i3 are dual cores that support ECC, aimed at NAS and the like (hence mini ITX boards with ECC SODIMMs). But you still need an explicit workstation chipset/mobo to support it.
i5 and i7 have no ECC because there are equivalent Xeons. At this level "Xeon" is just branding implying features like ECC, and it doesn't automatically mean expensive - a single socket non-E (LGA1150) Xeon build costs only a little more (~20%) than consumer cpu/mobo/ram (although there's better sales on the consumer stuff).
It used to be the case that all AMD CPUs supported ECC back in the AM2/AM3 era, apparently this may no longer be true though. Not all motherboards bothered to route the extra traces required for it and include BIOS support though.
All AMD FX and Opteron CPUs support ECC. The APUs do not. For the CPUs that support ECC, it is still up to the motherboard manufacturer to support it on their end as well.
Depends on the laptop. I have bought two laptops in the past 14 months, a $2000 ThinkPad and a $600 Acer. Both came with 4GB soldered on and a single free SODIMM.
(On a different note: the ThinkPad maxes out at 8GB and the Acer at 12GB, whereas previous generations went up to 16GB at least. Intel intentionally nerfed Haswell and newer core i's, presumably to push their Xeons on more people)
They're supposed to be priced around $350 or so, at least that's what I see from last year. How is that too expensive? An X series ThinkPad is like $2300+ with a good config. Adding another few hundred so I can have a decent amount of RAM sounds like a no-brainer.
(Or, Lenovo could put IBM engineering in charge and figure out how to get 2 slots back on the X series.)
Be careful... the 5th gen Intel CPUs can NOT run with standard 16GB modules. There is a technical issue which causes instabilities with 16GB modules, except if the modules are specially made to cover this issue. The Intelligent Memory modules do that.
There was an older paper discussing using various methods of fault injection (heat, voltage changes, etc) to attack Java smart cards, essentially destroying the type system guarantees and thus opening up an attack surface: "The Sorcerer’s Apprentice Guide to Fault Attacks", https://eprint.iacr.org/2004/100.pdf
Fault injection is also how older Dish Network and DirecTV smart cards were hacked - there used to be a cottage industry selling "voltage glitchers" to reprogram Dish Network smart cards with the keys for additional programming tiers.
I believe some pay TV smartcard hacks also made use of clock glitching, basically sending a shorter-than-usual clock pulse that means some of the internal signals don't make it to their destinations on time. The pay TV hacking industry had some pretty clever tricks a decade or two ago.
From memory, I think one card had some internal startup check that checked to see if its EPROM got marked by the "Black Sunday" countermeasure and then hung itself.
The hackers, having a ROM dump and having knowledge of how many clock cycles each instruction took the CPU, knew that it was at ~clock cycle 525 or so that this internal check happened.
Knowing that the instruction was a "Branch if equals to" (I think), and that instruction took 12 cycles, they figured out which of those 12 caused that branch to happen, figured out the precise time to glitch (whether via voltage or a single rapid clock cycle), and caused the CPU to skip changing the instruction pointer and then continue through its ROM code as if the check had passed.
Within a month or two, hundreds of thousands of receivers had a man-in-the-middle device just to glitch reprogrammed cards every time they were started up.
Apparently the north american provider had tested the same countermeasure in their south american division, so the north americans had advance notice of what they had to do to get back in action.
I recall, for another system, a small memory chip was required for a pre-existing man-in-the-middle card, and overnight every electronics supplier went out-of-stock overnight. Digikey sold out of 50k units overnight.
Other interesting lessons discovered:
1. You could run an >100' >100kbps rs232 link for over a year without issue. Proper wiring and rs232 length limitations be damned.
2. You could wire up an rs232 link (-12V and +12V) directly to a TTL input for over a year without issue.
People exceeding the defined limitations of things seemed to know better when it came to exceeding defined limitations.
Same with the JTAGulator units. 10+ years ago, countermeasures would reprogram the very-difficult-to-desolder TSOP EEPROM on the receiver.
The manufacturers seemed to use an externally accessible JTAG access point to program the receivers in the factory, which was a convenient boon to hackers that didn't even need a screwdriver to reprogram the units through their parallel ports.
I noticed the security implications of "memory that doesn't always behave like memory" when that paper came out a few months ago and was discussed briefly on HN:
You know, this makes me wonder. If a car manufacturer or a toy company made a product that was found to be unsafe, there would be a recall. If hardware manufacturers make a product that is insecure, will there be a recall? Unfortunately, I suspect that this is a case where the law hasn't caught up with technology.
A few years ago I built a home PC for myself and bought an i5 sandy bridge processor with an appropriate motherboard. A few months later it was found out that a huge batch of the SATA controllers shipped on those types of motherboards were faulty[0]. Back then, Intel made a statement recalling all faulty motherboards and shipping out new ones, I just contacted my retailer where I purchased my board, sent it for RMA and got a new one (different model, but that's another story). All of this for free.
Intel has a good history of recalls and replacements of their motherboards and processors. The Pentium FDIV bug comes to mind immediately, as does the recall of motherboards with the faulty 820-series memory translation hub.
Actually, Intels behavior with the FDIV bug was originally anything but good. They downplayed the bug and refused to recall them. Then they started offering replacements if you could prove that the bug affected you.
It wasn't until the whole thing turned into a giant PR disaster that they started a generous exchange program. That whole affair is basically the reason that Intel is much more forthcoming with errata these days.
Of the five vendors that they mentioned, the only one that did not have vulnerable memory was "DRAM vendor D", which also only had one entry on the table. Given the nature of the problem here, odds strike me as near-1 that "DRAM vendor D" has shipped RAM with this problem.
For that matter, the "no"s on that table really only prove that the exact stick they tested with the exact memory locations they tested did not exhibit detectable bit flips. It doesn't prove that those sticks are "safe", let alone that the product line they come from is safe.
So, basically, what's vulnerable? To a first approximation, everything. What would happen if we tried to recall every bit of DRAM produced in the past X years (where X is also unknown)? Well... you'd bankrupt the industry is what you'd do. That's not a very useful outcome.
In fact this sort of thing happens all the time. New safety tech is developed for cars all the time, but you can't go back and sue the auto companies for not including it before it was invented or the need for it was discovered [1]. This seems more like that problem than an actual problem of negligence or "defects" being produced.
[1]: Well... more or less. I know of cases where this was successfully done, though they tend to get overturned on appeal. Run with me here.
It's not just insecure, this is memory that doesn't work 100% like memory should.
I use MemTest86+ on every stick of DRAM I buy - if there's even a single error, it goes back as defective. The fact that this memory seems to work for most access patterns doesn't excuse the fact that it is completely broken for others, because good memory should be able to store any data and maintain its integrity for any access pattern.
Unfortunately even MemTest86+ is not exhaustive, as I found out while troubleshooting a very strange issue: a specific file in a specific archive would unpack with corrupted bits (and an "archive damaged" message) on a coworker's computer, but on half a dozen other machines would be fine. A hash of the file matched, so HDD-based corruption was ruled out. His machine passed an overnight run of MemTest86+ perfectly and AFAIK unpacking no other archives would yield corruption. He reported never getting any crashes - but yet, that one file in that archive would fail to unpack correctly.
It would always corrupt in the same strange way. On a whim, I decided to swap the RAM out and the problem went away. Even the "bad" stick seemed to work fine in other machines with the same model of CPU and mobo running the same OS and unpacking the same archive, but with his extremely specific combination of hardware and software, would always fail. That experience taught me that bad RAM can be extremely difficult to troubleshoot.
This isn't like other storage technologies e.g. SSDs where their finite lifespan and sensitivity to access patterns is well-documented. It's a case of claiming to sell memory while giving consumers a close approximation of one that completely breaks in some situations. I think it needs to be treated like the FDIV bug.
In the EU, products have to be fit for purpose. You could then argue that if you bought (for example) a server for hosting virtual machines, then the RAM was not fit for purpose because the flaw made it incapable of isolating separate VMs.
Modern medical technology relies heavily on computers and software. Take an infusion pump for example. Controlled by a microcontroller and using software. Or insulin pumps; and some vendors are actually considering to add Bluetooth to insulin pumps, so that patients using such a pump can check its status on their smartphone (or on the upcomming smart watches). Also you can adjust the infusion rate of an insulin pump to accommodate for ingested sugar. Overdosing on insulin can send a person into shock and kill.
> Modern medical technology relies heavily on computers and software.
Which is why medical devices should all have ECC memory. And for that matter physical separation between any processor that might run attacker-controlled code and the processor responsible for That Which Must Not Fail.
Product defects like this are foreseeable. If bad memory can cause a medical device to kill someone, the party at fault is the one who made a medical device without sufficient redundancy and error correction that bad memory could cause it to kill someone.
It's an interesting attack vector, recently covered by Person of Interest episode, in which an abusive husband got killed by having his insulin pump wirelessly hacked and making him overdose the drug. While fiction, I'm pretty sure this kind of thing will happen (after all, no one writes bug-free software, and even if, you can always steal the keys...) - and initially will be very hard to detect because of its uncommon nature.
Laptops are particularly at risk for stuff like this: components are more densely packed and may use smaller process sizes and have less powerful supplies which may be a factor in keeping bits in adjacent rows stable.
That may be the reason why the desktops mentioned are less sensitive, they'll use full size memory modules and will have beefy power supplies.
It'd be interesting to repeat the experiments with the laptops running off their internal battery.
Also, lower refresh rates on DRAMs means less power consumption (so it's an easy fix in BIOS, independent of OS, clearly attractive to laptop makers), but also more exposition to this issue.
Very little information on time scales. In one case they speak about 5 minutes vs 40 minutes (both might be acceptable for an exploit). Also no information about how long it took to bitflip in their per-hardware table.
And why name no hardware vendor ? I'm guessing they expect people to use the tool they provided and draw their own conclusions, but I don't understand why they'd treat them differently from software vendors.
At a guess to avoid labeling laptop manufacturers and getting sued if it turns out that something else was at fault? The DRAM itself might be the culprit (probably is), laptops of a certain brand might come with RAM from different manufacturers.
I understood the litigation risk. In an integrated system it's always someone else's fault (DRAM, BIOS, CPU, laptop vendor). IMHO the last integrator (the one selling you the goods) is always the culprit.
Why would they fear hardware manufacturers' litigation more than software vendors' ? Especially at such a big company like Google ?
They also don't want to say "DellappLenoHP" laptops could not be attacked and turn out to be wrong. Or maybe they're right but only with factory 2GB modules used between May '11 and July '13.
Way too many variables to make any claims that is ethically defensible.
They could specify the detailed system configuration with the CPU, chipset, and DRAM part numbers (including date codes) so others can compare. It's much better than leaving things in the dark completely.
A vulnerability in the Windows kernel is going to exist in all Windows kernels of the same version. One laptop with bad RAM doesn't mean all similar models have bad RAM.
If they did so... the fallout would be interesting. Does anyone know what proportion of modern memory has this flaw? Would it result in tens of thousands of customers returning stick after stick of DRAM until they were able to get a reliable one?
What's the difference between memtest86 and memtest86+?
OK, from WP [1]: "Memtest86 was developed by Chris Brady. After Memtest86 remained at v3.0 (2002 release) for two years, the Memtest86+ fork was created by Samuel Demeulemeester to add support for newer CPUs and chipsets. As of November 2013 the latest version of Memtest86+ is 5.01."
And the original has become a commercial program by PassMark. So I think at this point if anyone is talking about memtest86, they're likely referring to the still open-source '+' version.
On my desktop (DH87RL / i7-4770 / 2x8GB Crucial DDR3L-1600), rowhammer_test reported errors after ~20 iterations (less than a minute).
I went into the BIOS and tried lowering the tREFI value from 6300 to 3150 (not sure what the units are). So far, it's gone 1000 iterations with no problems detected.
Edit: Actually, the units are probably multiples of the cycle time, just like CAS latency. So, for DDR3-1600, that would mean 6300x1.25ns=7.8μs, and 3150x1.25ns=3.9μs
I tried and it reported one error under a second. I had to reboot because gcc started to make bash crash, it seems. Then I saw the README (duh!):
Be careful not to run this test on machines that contain important
data. On machines that are susceptible to the rowhammer problem, this
test could cause bit flips that crash the machine, or worse, cause bit
flips in data that gets written back to disc.
**Warning #2:** If you find that a computer is susceptible to the
rowhammer problem, you may want to avoid using it as a multi-user
system. Bit flips caused by row hammering breach the CPU's memory
protection. On a machine that is susceptible to the rowhammer
problem, one process can corrupt pages used by other processes or by
the kernel.
For example, SECDED (single error-correction, double error- detection) can correct only a single-bit error within a 64-bit word. If a word contains two victims, however, SECDED cannot correct the resulting double-bit error. And for three or more victims, SECDED cannot even detect the multi-bit er- ror, leading to silent data corruption.
Technically, SECDED cannot reliably detect errors involving more then 3 bits since they might generate a valid code, they might not however and in that case they might be detected as single or double bit error or possible something else.
Yeah, ECC is going to make exploiting this reliably a lot harder - you'd need to flip three or more bits in the right combination, without first hitting a combination of bits that'd be detected as an uncorrectable error. Google's report suggests they haven't even been able to cause uncorrectable two-bit errors yet, let alone undetectable three-bit ones.
"We also tested some desktop machines, but did not see any bit flips on those. That could be because they were all relatively high-end machines with ECC memory. The ECC could be hiding bit flips."
Maybe, though as they say it'd potentially be possible to cause a cache spill and attack it that way. I was looking at the associativity of various CPU caches with a vague eye to trying this in JavaScript a few days back and in theory it shouldn't take many reads to evict a cache line, so long as they're from the right addresses.
row access counters in memory controller would solve this problem - too many accesses between refresh cycles -> force refresh cycle for that particular row/potentially affected rows
I haven't seen anything after 375 iterations (600s). So I may still be exploitable, but that means you'd have to keep something running at 100% CPU for > 600s and somehow have me not notice the laptop fans going crazy.
How long did you test it? The tests they did ran fairly long, possibly you'd have to run this for days to really be able to state that a particular machine/ram combination is not vulnerable.
Download the tool and try it yourself? It supports Mac OS and there is a mailing list to report affected machines (nothing seems to be posted there yet).
Is a memory error actually an exploit? If so then are the unwanted changes that occur with no deliberate action an example of the computer cracking itself?
I think there is a useful distinction between a fault/error and an exploit. A fault is a break from the "desired" or "expected" semantics of a system, while an exploit is an algorithm to predictably utilize a fault (or faults) to access unexpected behaviours in that system. I.e., a buffer overflow is a fault in a program (breaking the expectation that a buffer's contents will remain within a certain bound), while an exploit targeting that overflow will likely allow running arbitrary code in a program not designed to do so.
So, I'd put it, the memory error can be leveraged in an exploit.
Errors can be used as part or all of an exploit. Exploiting a system requires that ethereal value of "intent", and I don't think anyone would (currently) argue that computers can have intent. Without that intent, it's just an error.