Homomorphic encryption and zero knowledge proofs are the most exciting technologies for me for the past bunch of years (assuming they work (I'm not qualified enough to know)).
Having third parties compute on encrypted, private data, and return results without being able to know the inputs or outputs is pretty amazing.
The general concept is "I want to keep this data private and confidential, but I want to be able to rent elastic compute for it". Previously, at the end of the day, the processor could only do useful work on unencrypted data.
But something like "analyze this list of customer records for fraud" or "analyze this proprietary data for trends" has previously either required a lot of trust that your cloud provider isn't going to siphon your data, or just required on-prem hardware that you cannot scale as easily.
If "math on encrypted data" works, we could keep our data confidential while still sending of batches of it to be number-crunched at a cloud provider.
Or even start talking about distributed / peer-to-peer computing, where a swarm of users (or, say, businesses that wish to cooperate) can send compute jobs to each other without having to trust that the other members weren't going to go snooping through the data that was sent.
And for that matter if I were a provider of some kind of computational service for hire, I (and my insurers) might feel a great deal of relief at the idea that we’re no longer having to sit on a big attractive ransomable pile of our clients’ data.
The idea is the result is encrypted too such that, when the person that holds the key gets it back they can decrypt it and see the result. So you can run an inference algorithm on data without ever having to decrypt it and see what the data actually is.
It seems to me that this intrinsically is vulnerable to side-channel attacks, but it will be interesting to see if we can avoid those with constant-time algorithms or something.
No, FHE is not intrinsically vulnerable to any side-channel attacks, by definition; the properties hold no matter how the "executor" executes the operations and they include the assumption that obviously they can see (or change!) everything that happens during the execution.
However, that effectively means that any FHE approach has to be "anti-optimized", always touching all the data that might get touched, always taking all the conditional paths, not just the worst case path as in constant time algorithms, since the execution must return the correct result for any encrypted input data.
It's not that you can "see if we can avoid those with constant-time algorithms or something", the basic table stakes (which all FHE methods actually implement) is the constant-time algorithm approach taken to the extreme and then squared.
It does have "some" unavoidable performance impact because of that.
I suspect that you’re right, but my concern is more about unknown unknowns here: new tech like this often has failure modes we can’t imagine until the tech itself becomes commonplace.
Because, if you have a mechanism to run arbitrary computations on encrypted data, I’d be a bit concerned about an attacker running carefully crafted computations on the data and deducing information about the encrypted data based on how the ciphertext changes or the time/memory usage of the computation.
This isn’t really a particularly well-informed suspicion: it’s partly based on a sense that FHE is a “have your cake and eat it too” sort of technology that’s too good to be true and partly based on the recent discoveries that fairly well-understood things like speculative execution in CPUswere vulnerable to serious side-channel attacks.
If running computations on encrypted data is enough to compromise the integrity of the encryption then that encryption algorithm is not even EAV secure, and EAV secure is a pretty low bar.
zk proofs are used in public blockchains to transmit millions in value so I would assume they aren't vulnerable to side channel attacks at least not on a superficial level.
A homomorphism is a function so that f(ab) = f(a) f(b). So you can imagine the cloud customer wants to compute ab, but the use a homomorphic encryption function, f, to ask the cloud provider to computer f(ab) given f(a) and f(b).
Then the customer decrypts f(ab). This doesn’t imply any weakness in encryption.
FHE is a bit stronger than what I’ve described, but that’s the idea.
The issue with this use-case is that the "I want to be able to rent elastic compute for it" is not a fundamental need or desire, but effectively a desire to save costs for hardware and other infrastructure.
Like, for a true need you could imagine someone saying "I really want this, no matter how it costs" - but you wouldn't do that for elastic compute if doing that locally was 100x cheaper; and in a similar way, the "wish to cooperate" by sending compute jobs to each other is meaningful if and only if you can actually save effort this way, but if there's a 100x overhead cost for FHE (and in reality it's far worse than mere 100x), that doesn't make sense, and you'd simply be better off doing it without distributed / peer-to-peer / cloud computing and just buy and maintain your own non-shared hardware even if it's used only 5% of the time.
And I'm not sure if FHE can ever get to overhead rates so low that these use cases would start to make sense, like, there's no good reason to assume that a mere 10x overhead FHE is even theoretically possible.
No because this is the very bleeding edge of technology so there are only very limited (but very cool) tech demos of the technology? Nothing at scale yet.
One paper I used showed how X-rays can be analyzed without exposing the X-ray data itself.
This can be generalized more to any situation where one party owns a proprietary algorithm and another party owns sensitive input data but they don't trust each other.
Modern LLMs are actually the perfect example. You want to use chatgpt but they don't want to give you their model and you don't want them to see your input. If HME was more efficient, you could use it so that the person executing the model never sees your real input.
If it’s just that the parties don’t trust each other then the cost of HME has to be compared to the current “state of the art” which is contracts and enforcement thereof.
In practice, I don’t think those costs are that high because the rate of incident is low and the average damage is also low.
Yes there are outlier instances of large breaches but these seem like high profile aircraft crashes considering how many entities have sensitive data.
I feel like trust is a spectrum, and the promise of these techniques is that they reduce the need for trust in the first place.
We should consider what kinds of computational tasks today’s responsible parties (or their regulators, or their insurers) think of as too risky to casually trust to third parties under the status quo. For example with my block storage provably unintelligible if you don’t have the HSM I keep safely in my corporate dungeon, I’m comfortable not caring whose racks the encrypted blocks sit on. I’d have to vet those vendors a lot harder if they could read all my super secret diaries or whatever.
And, for that matter, it’s on the service provider side too, right? Even the contractual, spit-and-handshake pinky-swear-based mode of enforcement comes with significant compliance costs for service providers, especially ones operating in regulated industries. Perhaps it’s not too much to hope that effective and efficient HME techniques might reduce those service providers’ compliance costs, and lower the barrier to entry for new competitors.
I’m reminded how even non-tech people in my life became much more willing to trust their credit card details to online retailers once they felt like a little green lock icon made it “safe”. Of course a LOT changed over that same period, but still: the underlying contractual boundaries didn’t substantially change—in the US the customer, then as now, has only ever been responsible for a certain amount of fraud/theft loss—but people’s risk attitudes updated when the security context changed, and it opened up vast new efficiencies and lines of business.
It’s not too much to hope that HME reduces those compliance costs. However, I believe it is too much to assume there will be any material adoption before it can demonstrate that reduction.
Reduction of trust is not a value add, it is a cost reduction. Maybe that cost is blocking a valuable product/service but either that product/service’s value is less than the current cost of trust OR trust has to be far more costly in the context of the new product/service.
It’s only the latter that I find interesting which is why tend to be pretty hard on suggestions that this will do much for anything that currently exists. At best, it will improve profits marginally for those incumbents.
What is something where the price of trust is so catastrophically high in modern society AND HME can reduce that cost by orders of magnitude? Let’s talk about that rather than HME.
Data incidents cause more problems than can easily be resolved with a contract lawsuit. Perhaps the data was siphoned by a 3rd party that hacked your vendor, or a malicious insider at your vendor sold it to a competitor. Sure, you can recoup some losses by suing your vendor for breach of contract, but once the data is leaked, it's never secret again.
And then there's the example of businesses that work with lots of confidential customer data, like banks or doctors. Again, you can sue your vendor for breach of contract if they behave irresponsibly with your data, but your customers may not care; you're going to suffer a hit to your reputation regardless of whether or not the breach was your fault.
You can say it’s insufficient but it is what it costs them today.
I guess the better comparison is that cost in a financial statement plus some expected increase in revenue due to a “better” product.
Again, I think you are correct in your analysis of the improvements but that contributes little to the revenue as explaining the benefit to most customers requires framing your existing product as potentially harmful to them. Educating them will be hard and it may result in an offsetting realization that they were unsafe before and as a result were paying too much.
Not really, you would phrase it to your customers or investors as a way of mitigating risk. You can probably apply a price tag to that risk by estimating the impact of a data incident vs. the likelihood of one happening. Different businesses have different risk appetites, and I would hope that a board or C-Suite is thinking about what level of risk is acceptable for their business.
Mitigating risk is covered in the cost reduction side.
Yes the C-Suite is thinking about and mitigating risk. They probably know the exact number for a given class of risk in terms of current mitigation costs. You have to beat that by a margin wide enough for them to take action.
Even if you know their numbers and know you beat it by enough to warrant the deployment you will still get bumped if someone sells them a path to increasing revenue.
The out I gave was to frame it as value added (more revenue) and that is where you risk devaluing your current product.
If you frame it as cost reduction you are capped in both price and interest by the current, necessarily acceptable, levels of risk and cost of mitigations.
> One paper I used showed how X-rays can be analyzed without exposing the X-ray data itself.
Is there any hope of getting the performance to the point where something like that would be feasible? I’d imagine the raw data itself would be so big that the performance for anything non trivial would be unworkable.
It lets you have someone add up your numbers, and give you the sum, without knowing the input numbers or their sum. Basically, any time you want someone else to host the data, while you can also do queries/computations on it.
That now generalizes to any computation (not just addition), because there is a logically complete set of primitive operations for which fully homomorphic encryption (FHE) can be done.
Caveats:
1) You can't actually do e.g. while loops with run time unknown. All such FHE computations are done by generating a fixed size boolean circuit for a given input size. It's "Turing-complete" in the sense that you can size up the circuit to any input size, but it wouldn't directly implement an unbounded while loop -- you have to generate a different "program" (circuit) for each input size.
2) All such computations must have a fixed output size -- else it leaks information about the data inside. So allowable computations would be like, "give me the first 30 bytes of the result of querying for names in the set that begin with A". If there are no results, the output still has to be 30 bytes.
3) For similar reasons, any FHE computation must look at all bits of the input (otherwise, it leaks info about what's in them). "See if this value is in the set" can't just jump to the relevant section.
4) The untrusted third party knows what computation you're doing (at least at a low level), just not the data it's being performed on or the result.
5) As you might have expected from 3), there's a massive blowup in resources to do it. There can be improvements, but some blowups are inherent to FHE.
In addition to the other answers, any situation where you want to prove identity but want to preserve privacy at the same time. For example you could prove you're an adult citizen with ID without revealing your ID number, picture, or any private information whatsoever.
There are examples in cryptocurrency, where ZK proofs are all the rage. One use case is validating ownership of some amount of currency without knowing the actual amount, or verifying that a smart contract was executed correctly without revealing the specifics of the contracts internal state. Some blockchains use this in production, such as Polygons ZkEvm.
There are a few that I've thought about but the first one that comes to mind is proving you're 21+ to an alcohol store cashier without handing over your ID with all your personal info. You could just give them a ZK proof that returns a true/false and they could check it against encrypted data (signed by the DMV).
Why is that preferable over a message attesting “over 21” signed by the DMV?
The hard parts here are retrofitting society to use a digital ID and how to prove that the human in front of you is attached to that digital ID.
The solutions there all seem like dystopias where now instead of a bouncer looking at your ID for a few seconds, technology is taking pictures of you everywhere and can log that with location and time trivially.
It doesn't have to be a digital ID, it can just be encrypted data encoded on a regular ID on a QR code.
Age depends on timestamp. The encrypted data is stored on the ID and signed by the DMV, with a function that can be run by the bouncer's scanning machine that plugs in a now() timestamp, and receives a boolean in return. The DMV doesn't even need to be involved after the issuance of the ID and no network access is needed for this calculation.
No one's location was tracked and no one's picture was taken and now a bouncer who fancies you can't turn up at your house after glancing at your ID.
Age verification (without leaking other PII) was the illustrative scenario for W3C Verified Credentials (it lets you use a validating authority to sign specific subset of your schema).
There’s lots of other ways to solve the problem for verification/signing use cases tbh. Homomorphic encryption shines best when you are looking at more complex calculations than just a Boolean result - such as tax calculations.
Can you submit your financial information and have your taxes calculated without revealing the amounts involved? Can you apply filters to an image without having the image readable by the server? It essentially allows us to “trust a remote server” in scenarios where one wouldn’t usually.
Still doesn’t quite justify homomorphic encryption if the computations involved are mundane (like tax calculations). The true use case is when the computation itself is proprietary, so that users don’t want to reveal their data but the provider also doesn’t want to divulge their algorithm. This is why ML models or finely tuned search engines are the examples most commonly cited, but of course those are also far outside the scale that could be achieved.
> No one's location was tracked and no one's picture was taken
I assume you mean the bouncer didn't take a photo, they just looked at the DMV photo embedded in the ID and did a visual comparison in their meat brain.
If there's no photo anywhere, how does the bouncer know I'm not using someone else's ID?
How do you know that the bouncers scanning machine didn’t log the interaction?
The whole value prop is built on not trusting that bouncer and by extension their hardware.
Everything would have to be encrypted leading to the bouncer also needing to establish that this opaque identifier actually belongs to you. This is where some picture or biometric comes into play and since the bouncer cannot evaluate it with their own wetware you are surrendering more data to a device you cannot trust.
They also cannot trust your device. So, I don’t see a scenario where you can prove ownership of the ID to a person without their device convincing them of it.
All sorts - anything where you today give your data to some SaaS (and they encrypt it at rest, sure), and they then provide you some insight based on it or allow you to do something with it (by decrypting it, crunching numbers, spitting out the answer, and then encrypting the source data again) could potentially be done without the decryption (or the keys to make it possible).
Concretely (and simplistically) - I give you two numbers, 2 & 2, but they're encrypted so you don't know what they are, and you add them together for me, but that's also encrypted so you don't even know that the sum of the inputs I gave was 4. It's 'computation on encrypted data', basically.
For example the Numerai hedge fund's data science tournament for crowdsourced stock market prediction is giving out their expensive hedge fund quality data to their users but it's transformed enough that the users don't actually know what the data is, yet the machine learning models are still working on it. To my knowledge it's not homomorphic encryption because that would be still too computational expensive, but it would be an ideal application for this.
Whatever Numerai has done to their data, there is no indication that they actually transformed it in a way that would hold up to a theoretical cryptographic adversary.
It would come in handy with not requiring any trust of the cloud that is running your servers. For protected health information this could potentially be a major breakthrough in patient privacy.
It would be hideously expensive due to the compute required and the total inability to monetize the data. Some people value their privacy that much; most don’t.
In an extreme case could Google use a mechanism like this to deny themselves direct access to the data they collect while still bidding in ad auctions based on that information?