I've been saying this from the start: the well of issues is infinitely deep as soon as you decide that multiple tenants running on the same physical hardware inferring something about another is a vulnerability. I assert, but cannot rigorously prove, that it is not possible to design a CPU such that execution of arbitrary instructions has no observable side-effects, especially if the CPU is speculating.
I don't know what that spells for cloud hosting providers - maybe they have to buy a lot more CPUs so every client can have their own, or commission a special "shared" SKU of CPU that doesn't have any speculative execution - but I know for me, if I have untrusted code running on my CPU, I've already lost. I could then care less about information leakage between threads.
We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.
> if I have untrusted code running on my CPU, I've already lost
Don’t forget about JavaScript, a common way for people to run untrusted code on their computers. Not all of micro-architectural data sample are exploitable in JavaScript, but some are.
Yeah, JS is the only hairy part. I considered mentioning it, since I know it was going to come up. But luckily, all I've seen so far are basic demos (like leaky.page) that read data from a carefully-crafted array that the page itself populated. I've yet to be convinced that you could realistically exfiltrate meaningful data at any sort of scale with in-browser JS, especially now that more blatant bugs like Meltdown are fixed.
If anyone can show a proof-of-concept ("this page grabs your password manager extension's data") I'll eat my words. But I feel confident that most of these issues are purely academic and, while interesting, serve more to provide content for PhD theses than represent urgent hazards on the web.
Indeed, I've been feeling indifferent about all these timing sidechannels ever since the very first ones (Spectre/Meltdown). The PoCs have not been particularly convincing to me, given that they are extremely contrived and rely on knowing the exact details of the system being exploited to such an extent that someone with those details would be better off with other ways in, and assumes those details haven't changed at all during the amount of time required to do the attack --- the nature of these side-channels is such that even the smallest change in environment can completely change the results.
In other words, if I choose a process on my system at random, and dump a few dozen bytes from it, I can technically claim to have leaked some data; but the use of that data to an attacker likely depends strongly on factors which are outside of the attacker's control. It's somewhat like finding a (real-world) key on the ground: you theoretically now have access to something you shouldn't have, but you have next to no idea what that something is.
But I feel confident that these issues are purely academic and, while interesting, serve more to provide content for PhD theses than represent urgent hazards on the web.
They also provide content for sensationalist clickbait articles and fuel the paranoia that drives society towards authoritarianism and furthers the war on general-purpose-computing, which IMHO is a much bigger issue to worry about.
> Yeah, JS is the only hairy part. I considered mentioning it, since I know it was going to come up. But luckily, all I've seen so far are basic demos (like leaky.page) that read data from a carefully-crafted array that the page itself populated.
Only PoC says very little. If I were head of a nation state APT I'd look into exploiting this because attack surface of JS is high. I'd only use it targeted, for example on Microsoft Azure team as outlined in Darknet Diaries #78.
If they’re targeting Joe and Jane Average, the long history of government tech procurement failures means I expect them to fail — fail dangerously, but fail.
How would the statically typed WASM open an even wider hole? Assuming you mean that the size of wasm's hole is larger than js, not that their combined holes are larger than either one.
Wasm has more control over time and memory access than JS does. From a capabilities model, it is more secure, but from a threat model due to side channels, Wasm is a more effective tool than JS.
Be afraid. Be very afraid, the hackers are reading this too, as well as the malevolent nation-states hell bent on hacking.
I cannot show you a proof of concept right now, but I am betting that within a year, maybe even as quickly as six months. you will see this in the wild.
It sounds like a dream, but going back towards interpreted JS instead of JIT may finally stem the insanity of bloat that JS has evolved in an environment of increasingly fast implementations.
The problem of Javascript bloat doesn't have a technical solution.
Javascript bloat exists because of a social problem: the guy who fixes the corporate webpage's javascripts is called a "webdesigner", and "webdesigners" are the lowest rung on the corporate IT ladder, maybe only a bit above first-tier techsupport.
If you want to make some sort of career you need to upgrade from "webdesigner" to "frontend developer", and that means cryptic, incomprehensible and pointless "frontend frameworks".
It provides to value to business or users, but management puts up with it because it fixes the problem of employee churn. (Frontend positions are a big pain in the ass.)
I posit something even simpler. JavaScript bloat exists because it's easy to learn and put something real on a screen for a newb, and it's a pleasure to write in. Writing these frameworks/libraries/websites/whatevers is literally its own reward, and the barrier to sharing tools is low. That, coupled with enthusiastic developers across the entire spectrum of niavete and experience finding new tools fun and exciting to develop and use, and you have an ecosystem with endemic bitrot. It feels absurd to have to say this, but the people who are a part of this ecosystem and contribute to the bloat do not despise the ecosystem the way HN people seem to. They don't see it as broken. It's not going away.
>Javascript bloat exists because of a social problem
JS bloat exists because HTML Working Group along with the whole industry believes JS is the solution to everything. They believe everything on the web should be Web Apps, and completely neglect Web Page development. It was only in the recent 2 two years did we start seeing discussions to reverse course.
But JavaScript bloat is allowed to stay (by product managers, middle managers, UX designers, etc) because the website is still fast. If the JS bloat actually caused the site to become too slow on fast machines, people with power to change stuff would demand change.
People with the power to change stuff thought Java applets were a good idea in 1995, when most computers in wide circulation could barely run the JVM at any acceptable speed.
Never trust the tech industry to make optimal decisions, you are only in for a bad time.
They can set metrics on quality, but those can be easily gamed. (E.g. measuring average TTFB for a site instead of the real wall time to show visible content for the user.)
Management has the power to say that something isn't good enough and make it a priority. They also have the power to hire employees or consultants if the current team isn't capable of doing it.
Project management _definitely_ has the power to dedicate time to fixing performance issues.
Metrics can be gamed, but certain metrics - such as time to interactive, and time to fully loaded - are fairly well in line with what users actually care about. Even if they're gamed, a project manager can say, "This still feels slow to use. Dedicate the next (sprint|cycle|month|whatever) to performance work."
Once JIT is disabled, webapps are no longer viable. Which means we can start to deprecate features content-based websites don't need, and eventually, JS itself.
No reason to insult web developers in general. The simpler explanation is that webapps exist because of an economical problem: that you can make more money (have lower barriers) by either recurring payments for services, or by selling your user's attention, or both.
It is a technical problem. Fix the platform by enabling the writing of modular code with controlled, scoped imports and exports between html, css and JS, and you'll see the bloat go away.
> We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.
This actually excites me. When the foundation is shown to be rotten, it's time for a new foundation.
I'm optimistic, though, that the future holds a fork, with some devices insecure-but-fast and others secure-but-slow. Because there's a market for both. I don't care if my gaming hardware is vulnerable to Spectre because ideally there's nothing worth stealing there anyway. Email/messaging hardware can afford to be a -lot- slower than my gaming rig without any appreciable impact on the experience.
Perhaps the future holds motherboards that look like the physical embodiment of Qubes OS, with secure and insecure chips running compartmentalized features based on their security/speed requirements. We already do something like this for performance with the divide between CPUs and GPUs.
I have a feeling that the wide spectrum of different ways to mitigate CPU bugs will slow down the demand for a new "slow and steady" CPU architecture. Linux already comes with a feature to wipe the L1 cache on every context switch -- simply enabling this option will compete with brand new architectures for a while.
A major advantage of not adopting new CPU designs for a while is that you get to keep insecure-but-fast and secure-but-slow behavior in the same CPU by simply tweaking mitigations.
Good thinking, although I'd be a little wary about making a clear distinction between those two classes of device.
It'd seem both theoretically and practically possible to engineer hardware that could enable and disable certain optimizations and extensions dynamically.
(note that energy consumption may also be a related factor here)
The safe execution of any untrusted Turing complete code is a pipe dream.
You, at least, need a clean sheet CPU design starting from ISA, and basic logic operations formally validated against instruction level analysis to have a fighting chance.
But even such chip do get pwned, as shown by key recovery from credit cards in the wild.
> The safe execution of any untrusted Turing complete code is a pipe dream.
I don't think that's true. It's not turing completeness that's the real problem here. It's that software usually has access to accurate timing information, whether it's via RDTSC, gettimeofday(), or sharing memory with another thread that does things that take a predictable amount of time. If a program has no notion of current time and cannot measure how long something takes, then a lot of those side channel attacks no longer work. (Note that this precludes using styles of threading that have nondeterministic results, but it doesn't preclude using styles of threading that are deterministic, like Haskell's parMap.)
I do think maybe we should move away from the model of "let's let people run programs comprised of arbitrary instructions on their computers, and build all our security around keeping programs from reading and writing things they shouldn't" to a model of "all programs running on this computer were compiled by a trusted compiler, and our security is based on the compiler disallowing certain unsafe constructs". This is sort of analogous to web browsers running javascript in a sandbox, or running eBPF in the Linux kernel.
That's not what the poster is talking about though.
They're talking Von Neumann with a special "blessed" tooling written to not produce behaviors that would let users do nefarious things. They want to reduce the space of possible computations from arbitrary to "only these patterns which are provably safe".
Essentially, they want to hobble the user (malicious or not) and force good behavior by giving them tools that are incompatible with malicious behavior. They want Asimov's Three Laws of Robotics for computation.
The issue being, you run into Halting problem real fast when trying to make that blessed toolset. How does it recognize malicious code or bad series of individually benign, but collectively malignant opcodes? Remember, side channels like Spectre and Meltdown boil down to timing how long it takes for a computer to say "no", and then for you to access a piece of data you know should only be cached if the conditional that was preempted by an access violation was one value or another.
That is, start timer -> run (expected to raise access violation) conditional, branch speculative load -> access check -> exception -> check for result value in cache -> stop timer -> rinse -> repeat
Each of those is a benign command that could be sprinkled in anywhere. Collectively, they are a side-channel. You could still make variations of the same setup by tossing in junk values in between the necessary steps that would avoid this blessed tooling's (assumed) unwavering pattern recognition. I wouldn't actually use a compiler to stop this. You'd use a static analyzer to recognize these combinations; and even then, there's a lot of timer -> thing -> timer stop -> check programs that aren't malicious at all out there.
The answer with computers has been "if it absolutely must remain secret, implement security at a higher level than just the computer". Everyone should know that if you've got access, the computer will do what it's told to do.
The poster's suggestion is a pipe dream; and a dangerously seductive one at that, since anytime you hear from the "Trusted/Secure Computing" crowd, it almost always means someone wants to sacrifice everyone else's computing freedoms so they can write something they can pretend to guarantee will work.
Sorry, the cynicism leaked in a bit at the end there; but I have yet to see a security initiative that does anything but make life miserable for everyone except security people. I'll put up with some unsafe behavior in order to keep the barrier to entry low for the field in general; and accept the cost of more rigid human centric processes to make up for the indiscretions of the machine. Keep abstraction leakage in check.
>The safe execution of any untrusted Turing complete code is a pipe dream.
The safe execution of any code requires an operating environment that never trusts the code with more than the least privilege required to complete a task. It has worked in mainframes that way for decades.
The IT zeitgeist these days makes me sad. Things can be better, but almost everyone is pushing in counterproductive directions, or has given up hope.
> The safe execution of any code requires an operating environment that never trusts the code with more than the least privilege required to complete a task. It has worked in mainframes that way for decades.
It has nothing to do with any OS level security features. We are talking about things happening below the level of what software can see.
You just cannot see any sign of such attack by looking at any register the OS can see.
Timing attacks are only a subset of side channel attacks, though. One can also imagine thermal attacks -- the amount of power you consume leaks information about what you're doing. And if I share a processor with you, there's various ways I can imagine estimating your power usage. On a processor that has dynamic clocking, the clock speed I'm running at is an indicator of the operations you're doing. Even without dynamic clocking, the probability of an ECC error, for example, is likely to change with temperature.
Eliminating timing vulnerabilities is necessary to allow potentially-hostile workloads to share hardware, but it is not sufficient.
Determining what clockspeed you're running seems like it would also require access to timing information though, right? RAM errors is an interesting idea for sure, but I think that can and should be shored up at the RAM level. I think a strong sandbox, WebAssembly and the like, should be pretty reasonable to run untrusted.
A 2013 paper[1] demonstrating exactly that: side channel detecting thermals that's measured without measuring on-CPU timing.
Instead they measured CPU temperature through frequency drift measured through change of network packet markers. A bit contrived but they made it workable quite reliably.
I can still determine timing information by measuring how long it takes to execute a program. The only way to prevent this is to enforce constant-time programs by delaying a response until a specific amount of time (see constant time comparison functions in cryptography). That's not feasible for many applications, especially operations on a latency-sensitive critical path.
You can run algorithms deterministically in a multi-tenant system. Only allow tenants to run deterministic algorithms and side-channels are eliminated. Algorithms with provable time bounds can be run and the output delayed until the known time bound to eliminate timing attacks.
The people operating mainframes have a vastly diffrerent mindset and skills from the average computer/smartphone user. The former can and do dedicate 40h/week and more to studying documentation and tweaking the sandboxes of the stuff they run.
Yes, but it wouldn't include code that exploits those vulnerabilities, by design, and it wouldn't trust any other code, so in effect, it would shield the system from it.
Will this even be an issue when we have CPUs with hundreds/thousands of cores that can just sandbox processes to their own set of cores/cache with exclusive unshared memory?
I think this idea could be taken further: just build physical machines with lower capacity (RAM, cores), rather than filling data-centers with top-spec hardware then dividing them up with virtualisation. On the face of it at least, this seems like an idea worth taking seriously. With the right form-factor, I imagine it shouldn't even have much of an impact on space efficiency or power efficiency. Perhaps the CPU companies just aren't interested in making such hardware?
And "lower capacity" isn't even that low any more, just in comparison with top-of-the line. Think a raspberry pi or basically any cellphone's main logic board. My motorola g7, that I got for something like $150 new, has Snapdragon 632 processor with 1.8 GHz octa-core CPU and Adreno 506 GPU, 4 GB of ram, and 64 GB internal storage. A pi4, for under $100 has a Quad core Cortex-A72 (ARM v8) 64-bit SoC @ 1.5GHz and up to 8GB or ram. Those specs far outclass most budget VMs and are more than adequate for the vast majority of workloads. All that either is missing is a proper storage port (i.e. not an sd card but something like sata or m3), but otherwise how many raspberry pis could fit in a 1u enclosure? Even being generous and giving half of the volume to disks, dual power, and cooling it's still quite a few.
Yes, there are definitely workloads that will benefit from better hardware, e.g. video transcoding or pure number crunching, but i would contend that most websites, databasing, ci, &c could be done on something like a pi replacing a vm or 3 (of the same customer).
Wouldn't that presumably cause energy costs to skyrocket because of all the overhead you get from going from multitenant machines to dedicated ones? Even if the capex is compare I'd imagine it would be hard to get the opex to be competitive
You could very much design and verify that a CPU such that execution of arbitrary code has no observable side effects.
However attempting to do so is multi year, for modern CPUs certainly multi decade project. This is not feasible as long as Moore's Law goes on.
Formal verification of kernels (SEL4) and a micro processors it runs on have been performed together to proof properties.
Large ALU blocks and vector units (such as multipliers) can be formally verified.
However efforts of end-to-end formal verification of entire processors won't happen unless there is demand to justify the huge investment of engineering resources and the market would be fine with chips many years behind.
Unless everyone is running dozens of clients on all their hardware it is cheaper to give everyone their own machine instead of investing the engineering time in to verifying chips. The economic incentives for Intel that selling more machines for isolation brings them more revenue without having to make investments in to decade long verification projects means it is not a mathematical certainty that it can't happen but just an economic one. After all speculative execution is just extra state and extra logic which can be formally verified. After all programs are just bit patterns and all quarters can be addressed in the formalism of Quantified Boolean Formulas.
But nobody is gonna undo 20 years of performance. You will just be told to buy more machines to isolate workloads.
The largest system I'm aware of that meets formal verification -- like CC EAL 7 -- are small smart card (Gemalto) Operating Systems. Has it been done on anything bigger?
I've been saying since this initially came up that big.LITTLE is the long-term solution for this.
In the grand scheme of things, high-intensity tasks are only infrequently high-security tasks - those two sets of workloads are mostly disjoint. So the long-term solution is to have "fast cores" and "secure cores".
The fast cores can have all the OoO, speculation, all of that good stuff. That's where you run anything that needs to go fast, or anything running "trusted" code. By and large, nobody cares if an ffmpeg process or HPC node might leak data. Databases? You control the queries that are running on them, right? There are some edge cases like video games where leaking data is moderately harmful (could be useful for exploits if you can reliably leak useful data) yet you still want maximum performance, but at the end of the day leaking data at a couple kB/s usually isn't going to be the end of the world especially if the data is rapidly changing.
If the code is untrusted or user-generated, or the data is sufficiently sensitive, then run it on a "secure" core. The "secure" cores have to be in-order, non-speculative, all that crap. Probably non-SMT as that seems to be a bottomless pit of sidechannels as well. But usually, you aren't churning huge workloads in the "secure" situations. You can still have crypto acceleration instructions built into the cores, AVX, whatever, just not speculative. It's probably better to get them fully out of the "normal" cache hierarchy as well.
There are a couple obvious problems here, but much smaller than trying to fix everything for every use-case. In particular web browsers are running untrusted code, and every single website is running 15 mb of shitty javascript code. It sucks but it's basically become an inner platform and you can't trust the code that it's bringing in, so that needs to be permanently isolated on its own secure cores. People will have to start paying attention to the performance of their javascript and optimizing out the real shitty bits.
Another big one is shared hosting environments - VPS environments are a prime target for trying to leak data from other clients on the same core/cache hierarchy, so those either need to be moved to "secure" cores, or switched to a model of renting out a whole core (or moved to a "hard time slice" where when the slice goes active you get the whole core for X seconds, then the processor stops, flushes everything, then switches clients). But VPS could conceivably be moved to "arrays of little cores" (to the extent that they aren't already) and that won't pose much problem for a lot of typical "micro" use-cases as long as every instance doesn't hit the server at once. Maybe for people that need faster than a dedicated "little" core the next increment becomes leasing the whole core, or even the whole complex of cores on that cache hierarchy.
Web application servers (not necessarily databases) are another one, unfortunately, since you can time web requests and use that to "leak" data down different code paths. If it's a directly user-facing service, probably best to get it onto a secure core.
The big task for humans is going to be identifying what stuff is allowable to run on the "fast" cores, and then get the schedulers set up so they understand that some stuff can only run in certain processor domains. It's not insurmountable, it just is going to take some time to plug away at it. Perhaps distribute whitelists, and allow the end-user to manually override it if they're really sure.
But yes I've been saying that too, my suspicion is that basically all of OoO and speculation is fundamentally incompatible with not leaking timing data between processes, and that the harder we tilt at this the more attacks we're going to turn up, it's going to turn into an endless game of whack-a-mole and it's going to eat up all the performance gains that we've spent the last 20 years building on the backs of OoO and speculation.
AMD is quite well-placed for this imo since each CCX basically acts like its own NUCA (non-uniform cache architecture) domain and they just happen to share a memory controller. That's pretty much the design you need to make it work right, just with big and little CCXs instead of only big. They just have to come up with their own little cores. Intel is going to be harder because the classic Sandy Bridge architecture (which is largely unchanged today) has all the cores collectively sharing their last-level cache, and I think that's probably a problem in the long term too. I think Skylake-X still works on the principle of cache being attached to each core and them talking to each other to share it.
AMD and Intel engineers, please make your consulting checks out to 'cash'. Thanks! ;)
> It's probably better to get them fully out of the "normal" cache hierarchy as well.
I can’t seem to shake the notion that this idea of transparent, multi-level caching might have to go away too. That cache shared between cores may have to morph into a layer of chip-local memory that you allocate imperatively. It’s possible that languages like Rust or VMs like the Beam could either adapt to such hardware with fewer problems, or even leverage it.
We keep trying to pretend like memory is flat but now we’re up to 3-4 layers of cache and memory banks. How much longer can you torture that abstraction?
Not sure if you meant this in a disparaging context, but (a) app authors could make use of cryptographic acceleration instructions, (b) this particularly goes double for "Apple keyring" or whatever the Android equivalent, those will definitely get acceleration right off the bat, and (c) people overestimate how long those tasks take anyway. I have KeePass set up so that it takes 1 second per attempt on a fast processor, and my J5005-based NUC takes about 3 or 4 seconds to decrypt it. Probably about that long on my iPhone as well. Annoying, a bit, but it's not like you're waiting there for literal minutes either.
And ideally that stuff could be moved into an on-processor secure enclave, so it's not executing on general cores at all. That way you straight-up can't even get to the data to try decrypting it, it just stays inside the enclave and the enclave doles out a single password at a time if and only if the password matches.
I think we will eventually see a return to company 'data centers' away from IaaS plays.
That said, its a pretty amazing time if you're a computer architect since you now have the transistors to spend on pretty much any crazy scheme you can dream up. So perhaps we'll see 'code safe' computer architectures emerge.
> I think we will eventually see a return to company 'data centers' away from IaaS plays
I don’t see why. Cloud providers have been offering dedicated hardware for a long time. If this problem isn‘t reliably fixable then more customers will make use of these options.
I think that cloud providers get dedicated SKUs. I can imagine if you give each VM dedicated cores and you can partition L3 cache per user, you could mitigate most of those issues.
It’s perfectly possible, you just can’t take shortcuts, you’ll lose quite a bit of performance. And of course if you really ‘share’ a resource the users are going to know a bit about each other. If each users share grows and shrinks as others are using more or less they are going to know that. But if you don’t want that you could just reserve a fixed part of the resource and they wouldn’t know.
It’s also possible that we strip off a ton of complexity and then find new performance directions that are better.
For my money we’d end up going toward many-core with loads of simple in-order cores on a die. It’d almost look like a GPU. With 5nm how many in-order ARM or RISC-V cores could you put on a chip? You’d also probably move away from shared caches toward each core having more cache and processes having stronger core affinity. That would be both faster and less likely to allow cache timing attacks. You’d have so many cores a core per process would be feasible with sharing only happening at saturation.
Another direction would be to go back to trying to crank up clock speed with some new approaches. What could we do with today’s manufacturing techniques if we focused on faster transistors more than smaller ones? AFAIK almost nobody has been working on this since the game has been to use more transistors to implement more features and hacks instead.
I read about 10ghz parts on the lab bench in the 2000s. That’s eternity ago in terms of semiconductor process. A 10ghz in-order core would be like a 4X parallel 2.5ghz core… roughly… but more secure and broadly faster on code that’s hard to parallelize. Get rid of speculation and instead give it low branch latency and a ton of on board cache.
> "What could we do with today’s manufacturing techniques if we focused on faster transistors more than smaller ones? AFAIK almost nobody has been working on this since the game has been to use more transistors to implement more features and hacks instead."
Plenty of smart people spent lots of money trying it and as it happens the physics doesn't work out. There are countless articles explaining why processor clock speed isn't increasing, e.g.
The thing is most project could successfully run on a single dedicated server plus have a one or two spares. There is absolutely no need for a slow virtual nodes. I always thought of those cloud solutions as a clever scam.
If you see them as a scam you're welcome to not use them. There are still colo facilities out there, and if you don't want to use that there are ISPs that will let you connect to the Internet.
20 years ago, computer magazines wrote about single-core 10GHz CPUs. Billions of transistors. What we have can barely be described as performance gains more than "add SIMD and more cores, and performance hacks".
>Yeah, because they didn't realize how terribly the power consumption / heat output would scale; a 10GHz CPU will just melt itself.
Tesla, it was claimed, discovered something called "cold electricity" -- that is (according to the claim) -- when you ran it through a circuit -- it cooled rather than heated the circuit!
Now, today we have something sort of like this as thermocouples/Peltier Junctions (see https://en.wikipedia.org/wiki/Thermoelectric_cooling for more/better info on this) -- although it is not known if Tesla's "cold electricity" -- was talking about this effect and/or related -- or not.
Nonetheless -- it seems to me that IF (and it's a big if!) -- IF Tesla's "Cold Electricity" existed, IF it could be rediscovered, and IF it could somehow be integrated on a CPU either as part of or as auxilliary to the main CPU circuitry -- then the CPU cooling problem could be solved(!) -- or at least mitigated somewhat, to the point of allowing/permitting CPU's with higher thermal envelopes/tolerances/CPU speeds...
Again, there are some seriously big IF's there -- but I think it would be a great place for someone to do more research, or for researchers that might have an interest in this area...
It seems to me that Intel and AMD (or heck, any chipmaker for that matter!) -- might (or should!) -- have an interest for more research in this area...
In thinking about it -- It seems to me that there might be a relationship between heat, resistance, and unbalanced capacitance in a circuit...
In other words, you have a wire.
You put amps (at a specific voltage) through this wire.
If the wire diameter can't handle those amps (at that voltage, remembering that the higher the voltage -- the more amps that a wire of a given diameter can carry, case in point, high-tension electric transmission wires -- they usually never melt despite carrying huge amounts of electricity, the reason being that that electricity is at high voltage)
If the wire diameter can't handle those amps at that voltage (the the lower the voltage, the more it will heat at a given amp load), then it gets hot.
It starts to act less like a conductor -- and more like a resistor...
But wait!
Haven't we also seen this effect with capacitors that are fully charged (well, minus the heating)?
No longer does current pass through them at full capacitance -- as full capacitance is approached, they start to act less and less like conductors, and more like resistors!
They also want to "push back"!
Well, maybe wires which are under electrical stress (heating up, gaining resistance) act sort of like "mini-capacitors"!
That is, their capacitance isn't that much -- but they want to "push back" against the circuit, if only for a microsecond -- to release their micro-capacitative electrical load!
But -- in many places in a CPU -- if a bit needs to stay set to '1' for example -- this cannot happen -- because electricity needs to pass through that circuit constantly!
Solution: First, figure out a way to store bits in capacitatively balanced circuits (an LC coil would be an example of this, but there should be other ways to do it), this allows the circuit to "relax" regularly every millisecond/microsecond/picosecond (relative to CPU speed / transistor switching speed).
Net result is that circuit should not get hot, ever...
Rule of thumb (for future CPU engineers): If you're storing bits in a circuit that gets, or can get hot over time -- you're doing it wrong... (even though humanity's CPU engineering history up until this point in time is that every CPU created thus far -- stores and manipulates bits in circuits that generate heat!) <g>
>"Intel's suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute," Venkat said. "But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel."
>"In the case of the previous Spectre attacks, developers have come up with a relatively easy way to prevent any sort of attack without a major performance penalty" for computing, Moody said. "The difference with this attack is you take a much greater performance penalty than those previous attacks."
>"Patches that disable the micro-op cache or halt speculative execution on legacy hardware would effectively roll back critical performance innovations in most modern Intel and AMD processors, and this just isn't feasible," Ren, the lead student author, said.
The best part of the new "defense against Spectre" is that the LFENCE instruction has been around for ~20 years. It's not even not a defense against all variants.
So what? The defense relies on it serializing the instruction stream which is not necessarily true based on the semantics of the instruction (until it was retroactively documented to do so)
The micro-op cache is very small, on the order of ~1.5K uops AFAIK. It can also be repopulated quite fast. So yes, the performance hit should be quite small. You should presumably also be able to reduce the performance hit if you reduce the frequency of context switches, which should get easier the more cores you have, if I'm not mistaken. That is, the OS can have its own dedicated core, and some programs can be more or less pinned to other cores where they are rarely interrupted.
> You should presumably also be able to reduce the performance hit if you reduce the frequency of context switches, which should get easier the more cores you have, if I'm not mistaken.
Context switches don't happen that often due to preemption unless your CPU is oversubscribed. Most context switches are due to syscalls, especially the ones used to wait for contended locks. Reducing those takes a lot more optimization work.
Given the hockey stick number of cores coming at us, I see pinning and better temporal avoidance being solutions. High security code will be pinned to its own core, running in its own memory area.
I’d like to highlight this excellent post about x86 micro-ops “fusion” from three years ago, as it’s the reason I have any idea at all what micro-ops are:
You cannot realistically make a CPU invulnerable to performance analysis
And you don't need to.
There is really very few uses for real multi-system vs multi-process shared systems.
Take a look on that whole "cloud" thing.
All people I knew who worked in cloud hosting tell that most system are ridiculously overprovisioned, effectively nullifying any economic justification for a shared system
I usually end up over provisioning because I need something that is billed along with CPU; for example I have super good C or Go code to run proxies on, they use like 2% of the CPU when they max out the network connection. I add more so the bandwidth goes up.
This. I run into this all the time with WebRTC infrastructure. My SFUs run out of bandwidth long before they’re at 100% CPU. It’d be great if I could easily provision VMs based on bandwidth, but of course cloud providers are always real coy and say things like “this VM size class has Medium bandwidth, but this one has 25Gbps, no we won’t say which of those is bigger.”
Its possible that there are technical reasons related to virtual networks that may be restricting what kind of configurations are possible on their infrastructure. I would expect them to disclose it as such, but cloud providers haven't been very open about sharing those details.
There are separate micro op caches per core however they are typically shared among hyperthreads. I wonder if this could be another good reason for cloud vendors to move away from 1vCPU = 1 hyperthread to 1vCPU = 1 core for x86 when sharing machines (not that there weren't enough good reasons already).
One sneaky thing I've noticed them doing is slowly switching their licensing over to 1 vCPU = 1 CPU, even though you're now only getting one hyperthread instead of one core.
For Microsoft, this means that they've literally doubled their software licensing revenue relative to the hardware it is licensed to.
This kind of false incentive worries me a lot, because while I like the technical concepts like infrastructure-as-code enabled by the public cloud, I feel like greed will eventually destroy what they've built and we'll all be back to square one.
Ask your cloud sales representative these questions next time you have coffee with them:
- What incentive do you have to make your logging formats efficient, if you charge by the gigabyte ingested?
- If your customers are forced to "scale out" to compensate for a platform inefficiency, what incentive do you have to fix the underlying issue?
- What incentive do you have to make network flows take direct paths if you charge for cross-zone traffic? Or to put it another way: Why does load balancer team refuse to implement same-zone-preference as a default?
Etc...
Once you start looking at the cloud like this, you suddenly realise why there are so many user voice feedback posts with thousands of upvotes where the vendor responds with "willnotfix" or just radio silence.
Cloud vendors probably use a hypervisor that schedules the VM time slices in a way that hyperthread siblings are only ever cooccupied by the same guest.
ARM vendors must be feeling pretty good about themselves yeah, but if you take AMD's cores... SMT might not be a huge win in every benchmark, but you just can't keep that wide backend fed from a single hyperthread (at least I can't!).
So turning SMT off is at the least wasted potential for those cores, the way they've been designed
Probably because they have an 8-wide decoder and a massive reorder buffer, so they can actually keep the backend fed.
The problem with x86 is decoding is hell and requires increasingly large transistor counts to parallelize, so you end up with a bottleneck there. ARM doesn't have that problem.
Variable length, over lapping instructions has made x86 instruction decoding intractable. The obvious answer is make it tractable, the unobvious answer is how to do that and hopefully remain backward compatible.
The performance claims are true for all the worst reasons.
Let's say you can queue up 100 instructions. This yields the following
1 port 100% of the time
2 ports 60% of the time
3 ports 30% of the time
4 ports 10% of the time
5 ports 2% of the time
Increasing the buffer to 200 instructions yields the following
2 ports 80% of the time
3 ports 40% of the time
4 ports 15% of the time
5 ports 4% of the time
As in that made-up example, doubling the window you can inspect doesn't double performance. You really want those extra ports because they offer a few percentage IPC uptick, but the cost is too high. So you keep increasing the window size until the extra ports become viable. As an aside, AMD Caymen switched from VLIW5 to VLIW4 because the fifth port was mostly unused. A few applications suffered from the slightly lower theoretical performance, but using that space for more VLIW 4 units (along with other changes) meant that for most things the overall performance went up.
Now comes the x86 fly in the ointment -- the decoders width gives rapidly diminishing returns (I believe an AMD exec mentioned 4 was the hard limit to keep power consumption under control). This limits the size of the reorder buffer that you can keep queued up. Since you have a maximum instruction window size, you have a hard port limit.
So you add a second thread. Sure, it requires it's own entire frontend and register sets, but in exchange you get a ton more opportunities to use those other ports. There are tradeoffs with the complexity and extra units required for SMT, but that's beyond our scope.
As you can see, SMT performance is DIRECTLY related to how inefficiently the main thread can use the resources. In less interdependent code, SMT performance increases are worse because finding uses for those extra ports on the main thread is easier.
Now, let's consider the M1 and one reason why it doesn't have SMT. Going 5, 6, or even 8-wide on the decoders is trivial compared to x86. Apple's M1 (and even the upcoming V1 or N2) have wider decode. This in turn feed a much larger buffer which can in turn extract more parallelism from the thread (this seems to be taking about as many transistors as the extra frontend stuff to implement SMT). Because they can keep most of their ports fed with just one thread, there's no need for the complexity of SMT.
IBM POWER does show a different side of SMT though. They go with 8-way SMT. This isn't because they have that many ports. It's so they can hide latency in their supercomputers. It's kind of like MIMT (multiple instruction, multiple thread) in modern GPUs, but even more flexible. They help to ensure that even when other threads waiting for data that there's still another thread that can be executing.
The memory latency hiding also works with 2-way SMT. I worked on a networking software doing per packet session lookup in large hash tables. SMT with a Sandybridge core in this application gave 40% better performance which is higher than usually mentioned. So for memory bound (as in cache misses) applications, SMT is a boon.
The CPU in this case is a Threadripper 3970x, 32 cores, 64 SMT.
My experience is this: When the L3 cache is effective, then the memory latency hiding via memory prefetch works well across SMT threads. If the hashtable load requires a chain walk, the SMT latency hiding is less effective because the calculated prefetch location is not the actual hit. I couldn't get prefetching multiple slots as the load increased to be as effective as prefetching a single slot.
I tested this some years ago on a raytracer, and got a tad over 50% more speed when enabling HT compared to disabling it.
As you say, the ray tracer did a lot of cache missing , interspersed with a fair bit of calculations. I'm guessing this is close to the ideal workload, as far as non-synthetic benchmarks go.
4-way and 8-way SMT is about latency hiding (like MIMT in GPUs, but more flexible). It increases the probability that at least one thread has data it can be crunching.
Because the cloud is designed around people uploading binaries to your machine -- it is a basic principle of how services are allocated. When you go to AWS an spin up an EC2 instance, you don't get a machine to yourself. You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
You get a VM running with many other peoples VMs on some arbitrary server in one of their data centers.
Doesn't that make it even harder to do any sort of specific attack on anything? From what I understand, these side-channel attacks depend on being able to predict the addresses you'll read from and an idea of what you're after as well as a stable environment in which enough timing information can be collected, and any small changes in the environment will mean you can start reading something completely different without even knowing; a CPU that could be running literally who-knows-what at any time seems like it wouldn't let you collect much in the way of coherent data, and of course the VM you're doing it from could itself be moving uncontrollably across CPUs.
Question: How relevant are these for the average person? I know these matter for things like shared hosting, but I've yet to hear of an actual exploit in the wild that ordinary people have been attacked by, even with Spectre defenses turned off. Should normal people be worried about this?
I personally disable spectre/meltdown mitigations for performance. I don't think any of it is very important for my use cases and I don't leave sketchy websites open for hours on end to give them a chance to make use of the exploits.
Not really. So far these mostly haven't been able to cross process boundaries, and most browsers have tripled-down on process-sandboxing by this point (iframe sandboxing was the last major push here: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/if... )
So process-based sandboxing will continue to be the defense here, and process switching will just get a little bit slower as increasingly more caches are flushed (toss the uOp cache into that list now). For basically all consumer usages this will be perfectly fine. On the other hand, things like Cloudflare's Workers are looking a lot more suspect.
> So far these mostly haven't been able to cross process boundaries
Actually, most Spectre vulnerabilities, including this one, do cross process boundaries when they are first discovered, and kernel and microcode patches are needed to implement mitigations against this -- typically flushing some cache or something when switching between kernel and userspace. Often these mitigations hurt performance.
> things like Cloudflare's Workers are looking a lot more suspect.
Cloudflare Workers uses a completely different approach to Spectre mitigation, based on slowing down observability of side channels to the point that an attack isn't practical. More details here:
This approach doesn't target specific forms of speculation and therefore tends to work against the whole class of bugs, including ones that haven't been disclosed yet. The down side is that it requires restricting the programming environment including changes that would be backwards-incompatible for browsers, and it certainly wouldn't work at all with native code. Luckily Cloudflare Workers was able to design for these constraints from the start.
I'm the tech lead of Cloudflare Workers, so I may be biased. But, my honest opinion is that the cloud hosts that accept native code are in a much more precarious position than we are.
This is a threat that won’t be fully mitigated until one’s browser is updated to flush the Micro-Op cache.
Of course in the meantime, safe browsing practices such as avoiding untrusted Javascript will provide protection. But then, we should always be doing that anyway, so it's not as if this should be changing behavior of the average, security-conscious person. It's just another in an unending series of threats.
Spectre could too, but again, my point was that I didn't hear of actual attacks on people in the wild, at least not on any scale that seemed to make the news. Is there a reason to believe this will be different?
This had to come. The only fix will be to add a BIOS setting for Speculative Access or no speculative access. Gamers all turn it on, with a machine patched, that runs nothing but their game. Everyone else, like browsing the web, off. Look for a encoded binary java script exploit that will own any speculative access system. Its coming too, just like this paper would eventually come.
This is not feasible. Not everyone has multiple computers. And not everyone wants to use different computers for different things, or take the effort to muck around in the (often mazelike) bios settings.
I expect this to be just like Spectre. The media sizes it as a tool to use fear to drive engagement, vendors partially cripple their hardware to guard against it, and literally nobody ever bothers trying to actually use it against innocent people.
I don't understand this at all; I didn't think the mico-op cache was visible to code written for the x86 ISA at all. Can anyone explain to an idiot (me) how something in micro-op cache can become visible to the outside world?
I'm simplifying a bit (edit: quite a bit =]), but the way these attacks work is generally by exploiting the difference in timing between something being in cache, and something not being in cache. Or some resource being contended vs not contended.
If something is in cache, and you also have access to that cache, accessing that thing will be fast and few CPU resources will be used.
So you can tell that something is in cache. And you know you didn't put it there. So some other thread that you're sharing a CPU core with must have put it there.
To exploit those attacks, you're going to intentionally watch the other thread as it, for example, (speculatively) takes a branch, and either puts something in cache or doesn't. Now you know whether the other thread (speculatively) took a branch or not! Just from measuring timings of the cache.
From that, you work back to what the branch condition (that was still only speculatively executed) must have been, and if this branch is based on (speculatively loaded) data, you just leaked one or more bits of the data.
Suddenly, things are not speculative anymore. You guessed data that wasn't yours, because speculatively using it had an effect on the cache, and you could measure that effect. Here, they use the micro-op cache (I haven't read the paper, so I don't know the details, but this is broad strokes).
Any mechanism that you can use during speculation, and that you can extract timing information from is potentially a problem. And these are everywhere.
That's why the Spectre problem is so hard to fix now that pandora's box is open.
You don't need OS to build a very precise clock for the purpose of exploiting timing attack. That's why SharedArrayBuffer[1] is disabled in all browsers.
Moreover, not every side channel attack relies on timing attack.
For interested readers, there is a paper titled "Fantastic Timers and Where to Find Them: High-Resolution Microarchitectural Attacks in JavaScript", which discussed a variety of ways to build high resolution clocks.
SharedArrayBuffer is not disabled in all browsers. It was briefly disabled at disclosure time of the first Spectre vulnerabilities, but as browsers moved to site isolation (one process per origin), they have reenabled it.
Proper multi-level security doesn't allow access to the clock in anything other than the top level. You could just have a monotonically counter that is periodically synced to reality every minute or two.
I must be misunderstanding what you are suggesting, since what it seems like you are suggesting would never work. High-resolution timing information is available to user applications via numerous APIs today. Hyrum's Law, and common sense, tell us we can't just lose this feature of operating systems and expect applications to work.
You can construct a high resolution timer from shared mutable memory and multiple threads. It's simple. One thread increments a counter, and the other thread reads it.
Good to know. This new era of aggressive hardware flaw exploitation has me motivated to leverage my mobility and flexibility to evade. I don't think I have a better strategy.
In a modern processor they are however microcode as a general term is a catch-all term which basically means non-trivial configuration logic stored in somewhere not meant to be touched by people other than the vendor. I think IBM have millicode.
This quote from the article explains the danger quite well:
"Intel's suggested defense against Spectre, which is called LFENCE, places sensitive code in a waiting area until the security checks are executed, and only then is the sensitive code allowed to execute," Venkat said. "But it turns out the walls of this waiting area have ears, which our attack exploits. We show how an attacker can smuggle secrets through the micro-op cache by using it as a covert channel."
A close reading of the paper “I see dead uOps” would seem to indicate that Intel’s static thread partitioning of their micro-op cache would confer some inherent protection against uOp cache information leakage between threads - as compared to AMD’s dynamic thread partitioning scheme which could theoretically allow threads to spy on each other using the described techniques.
If true, wouldn’t this also imply that an Intel Skylake CPU mitigates against such attempted attacks by one user against another in a shared CPU/ISP/cloud environment, whereas an AMD CPU theoretically would not? If true, this would be a key point that the authors failed to mention in their concluding remarks.
Anyone else read it this way? Or am I missing something?
The act of loading code into memory, be it a hypervisor or a guest OS, should've been gated by sanitation and validation callbacks. Building all of these macro- and micro-op runtime defenses and mitigations in the processor and slowing down the OSes for every possible runtime edge-case are a waste of speed that can be avoided by establishing trust of code pages.
The morphing of data into code pages with JITs like JS should also be subject to similar restrictions.
I'm assuming these are instructions for self tests or verification. If so, removing the instructions after they are manufactured wouldn't be easy. You can do it in microcode at the cost of making all execution slightly slower (if instruction not in [a, b, c, d]) or by physically altering the die to remove those instructions. Either way, it doesn't sound fun. It's probably easier to leave them in.
There's no reason to remove them, that's not what I'm asking. By all means leave them in, but why leave them undocumented? Explain their existence, their parameters & capabilities. If not intended for use, explain that too.
Then when something unexpected like Spectre comes along, the people that have to deal with it can say "Oh yeah, those testing instructions provide another vector of attack that our patch has to account for."
Instead we're in this situation, and I'm pretty sure there's at least a half dozen nations that would have already devoted the resources needed to uncover undocumented instructions like this, meaning ample opportunity to have developed various exploits.
That failed for good reasons. Itanium processors ended up using out of order execution and speculation just like everyone else, because the compilers just don't have enough information compared to an out of order execution engine.
It is surprisingly difficult to make timekeeping unavailable. There are many methods, besides the official timer APIs, to get timing - as outlined in the paper "Fantastic Timers and where to find them" from TU Graz:
Out of curiosity, is Apple's M1 processor seemingly faster because it is actually more similar to a normal CPU progression but all the other common CPU's - x86 - had retroactive performance hits due to patching Spectre.
And therefore M1 seems so much more faster than it otherwise would?
ARM CPUs, including Apple's have been found vulnerable to some SPECTRE variants. As far as I understand, M1 already contains hardware mitigations for all known ones, but so do the latest Intel/AMD chips. (Or rather, contain fixes for all but these latest ones.)
The CPU needs to make the overheard signals look just like random noise. A cheap XOR-stream (compare 2FA like Google Authenticator, or the remote in your car keys) should cover that.
Well, some of these attacks exploit the actual values present in the memory, not their stored representations. Therefore it would not matter how you encode them on the way, right?
I don't know what that spells for cloud hosting providers - maybe they have to buy a lot more CPUs so every client can have their own, or commission a special "shared" SKU of CPU that doesn't have any speculative execution - but I know for me, if I have untrusted code running on my CPU, I've already lost. I could then care less about information leakage between threads.
We're going to wind up undoing the last 20 years of performance gains in the name of 'security', and it scares me.