Hacker News new | past | comments | ask | show | jobs | submit login

From the library readme [1]

> The database is a key value store prepopulated with the english names of countries and their capital cities from the continent of Europe. Selecting the country will perform a search of the matching capital. On a 2019 Macbook Pro laptop, the example searches take under 80 seconds.

Does this have a heavy performance cost for even toy applications?

[1] https://github.com/IBM/fhe-toolkit-macos/blob/master/Getting...




The example is called "Privacy Preserving Search". If the server could see which row was accessed, that would not satisfy the privacy requirement. This implies that each query needs to process all the rows equally (and all in the same complicated encrypted way).

Provable privacy like here or in zero-knowledge proofs is generally extremely expensive because every possible execution path needs to be taken each invocation. This is multiplied by the overhead from each simple operation becoming a complex cryptographic one.


You can use tricks to reduce this requirement. For example, you access a random subset, and at the same time you randomly rearrange a portion of the data. Viewer can't narrow anything down because you are shuffling fast enough to invalidate anything they learn from your access patterns.


You're trying to come up with a homemade ORAM. All I'll say is that this problem is harder than it seems.


You should be able to relax even that assumption: if the executor doesn’t know the contents of the database itself, then it may not even matter to you that they can correlate your query with the row in the database (they can build a book of correlations but they still would not be able to see the contents). It should be even more feasible if the query contains some random nonce too?


Working on privacy tech has taught me that zero knowledge is absolutely critical if you want true privacy.

Statistical correlation is stupidly powerful. Identifying someone uniquely in the world requires about 33 bits of entropy. Which means you have have some statistical analysis that reveals 0.01 bits of information per sample, you only need 3300 samples to uniquely identify the person performing the activity.

It's a bit tough to explain without having actual examples, but the same math that allows a javascript library running in a web browser to extract the private key being computed in a VM elsewhere on the machine can also be used to identify which individual people have diabetes from looking at a stack of "anonymized" health reports, or which individual people are watching porn from looking at an aggregated history of request sizes going through a VPN.


> 33 bits of entropy

For the clueless (including myself--I just looked this up), if doing a brute-force search to identify or reidentify someone using "anonymized" data (which is what this poster is referring to), it only takes 2^33 attempts, or 8,589,934,592 attempts (on a bit scale--binary--0 or 1), to exhaust all other possibilities (such as other people--when reidentifying people via health data that was "anonymized"), when tracing to find a person via data in general.

Obviously some data is more sensitive than others, but that has little relevance: it requires nearly no data (especially in terms of what is being hoarded about each and every one of us) and virtually no effort or computing power, to be found.

This also means that even the most mundane (and always sensitive, too--although some data is more sensitive!) data that has been hoarded about you will be used against you, in ways you would have never imagined. It will be used for processing and inferences (which you are unaware of) will be made on you.

You are already probably part of several unhealthy social experiments that you are unaware of, via correlations from mundane data, using things that are pervasive such as AI.


> the same math that allows a javascript library running in a web browser to extract the private key being computed in a VM elsewhere on the machine

I am not entirely certain whether I was clear about the point I made in my message, however I have asked a very specific technical question and you are making some generalised sweeping statement about privacy in the most widely used case.

We could not decrypt any ancient languages, for example; if we didn’t have an equivalent of a Rosetta Stone for them. Thus you should not be able to do the same for the encrypted database I assume.


"hiding the database" is security-through-obscurity, which is not security.


If they can correlate you with rows in the database, and they can correlate me with rows in the database, they can correlate you with me.

Given enough queries and perseverance in unraveling things, a surprising amount of information can leak, for example whether you and me live in the same country, frequently travel together (indicating a relationship), etc.


I said “relax the condition” and also said “they know nothing about the information stored in the database”.

I am trying to see whether I am missing something and you’re adding things I removed back into the equation.


I think what you are missing is yet your assumptions are unreasonable and unrealistic.

If we could just keep our data secret, we wouldn't need encryption at all.


Right but I would think you'd also need to avoid the executor learning that you were accessing the same record at different times. Merely encrypting the records wouldn't prevent this, so your query (to be homomorphically encrypted) would need to be executed in such a way that you look at more than one record.

You'd need that to stop the risk that, say, an attacker learns (from a side channel) you were looking for person A's details at time T1, and thereby unmasks which other queries/times were about A.


FHE has a lot of properties beyond just those needed to search a db. If you have a specific application in mind there may be more efficient things. E.g. if you want to keep secret what you are looking at but server knows the keys, see research on private information retrieval. See also some of the research on order preserving encryption and the like (although lots of those schemes have questionable security properties). Of course everything has tradeoffs, but often "perfect" security is unnessary in an application.


I'm wondering if this sort of thing would fit well with GPU processing, which tends not to do well with branchy code?


People are working on it. The biggest problem is that cryptography usually requires bignum math and GPUs are not optimized for that. FPGAs are also being researched. The cost/benefit is not as big as you would imagine, CPUs are pretty optimized for bignum math.


I attended the event celebrating the solving of LCS35. LCS35 was a cryptographic challenge set by Ron Rivest (the R in RSA) in 1999. It was intended to be so difficult that only computers made 35 years in the future would be able to solve it.

The challenge was solved by Bernard Fabrot. It was fairly simple code on I believe an intel 4770K. The math libraries and CPU had been optimized so much in the ~20 years since the challenge was set that a consumer CPU could calculate the answer. The only catch is that it took months on end of calculation.

https://en.m.wikipedia.org/wiki/LCS35


NuFHE can do a binary operations in 0.1ms/bit, and mux operations in 0.2ms/bit using FFT, which is hyperoptimized on GPUs. I think the issue is that the keys tend to be rather large.. NuFHE's decryption keys are 110MB! So memory/FFT problem size may be the bottleneck now.


In the algorithm implemented the search key is matched against all rows in the database. There is no short circuit evaluation and there cannot be... you cannot compare an encrypted value to a plaintext value and know they are equivalent without decrypting it first (which the server-side cannot do). This is true even for writing the code to do the search since you cannot do and if style comparison on encrypted values to break out of a loop. In the example, each value stored in the database is compared to the encrypted search term producing a partial result that is itself encrypted and meaningless to the server side. Once ALL value rows have been searched, the partial results (all those encrypted meaningless things) of all those comparisons are combined to yield a single encrypted result which is returned to the user that initiated the search. That result is then decrypted. There is no leakage because ALL code paths and all data must be evaluated for the program every time. However, from what I gather, the database can be shattered across multiple backend servers and each subset could be searched in parallel for a speedup for a practical deployment.


Does it actually have to process all rows equally? Would a bit of randomisation on search + erasure coding of data could allow both reconstruction of message, and preserving secrecy by querying random places for the same information?


I was oversimplifying to convey an intuition on why this is so hard. But yes, this is an active area of research.

In this academic field the standard for privacy is usually extremely high. Even if you only access half of the rows, that still leaks one bit of information and would not be considered zero-knowledge.


Well, the returned value leaks all the bits of information, unless that is randomized too to prevent replay attacks.


In homomorphic systems the server only sees the encrypted result. Since it can not decrypt it, it doesn't gain any information from it. And indeed the encryption is scrambled differently for each query, otherwise you could learn if two values are the same.


Why would you have to process "all" the rows equally? It should be more than sufficient for almost any practical scenario to use some constant factor of "scattering", i.e. access 100 random rows + 1 row that you actually are interested in.

It will probably be more complicated than that, due to repeated queries revealing patterns, but there are other mitigations for that, some of which are sure being developed.


You could trade off some security for speed like that, but 100 rows wouldn’t cut it. Ask for the same information again and you’ll probably only have 1 intersection. Even if you read half of the rows unnecessarily, you could figure out the true row in between 2 and approximately log2(rows) reads. Just isn’t worth the side channel attack.


If it's the same query, it would be the same 100 random rows, wouldn't it? Then repeated queries wouldn't leak.


No, if you run the same query the server must not be able to determine that it was the same query. Repeated invocations of the same question would be transformed with added noise during encryption so that the encrypted query and answer is different each time.

If the same query would access the same 100 random rows, then after seeing your (hidden) query and getting the 100-random-row signature it would be trivial to run a bunch of candidates and see which one of them gets the same 100-random-row signature and thus figure out what you queried.


Seems like the constraints of this basically mean that you might as well just stream the database to the client and let them do the query.


One of the scenarios is that the cloud provider has an algorithm they want to keep secret. Clients want to use the algorithm but dont trust cloud provider with their data.

Honestly though, i mostly find FHE interesting theoretically. The fact that its possible at all is black magic. I'm sure if it ever gets remotely efficient people will come up with more creative applications, but in the meanwhile its resesrch worth it for research sake.


> One of the scenarios is that the cloud provider has an algorithm they want to keep secret. Clients want to use the algorithm but dont trust cloud provider with their data.

That would indeed go a long way towards resolving the issues caused by the way SaaS is done. It would let us decouple compute provider from the service, in a way in which it would be the end users who are free to chose where the computation happens, and they'd own all the data by default, while the business secrets and IP of the service provider would still be protected.

Shame to see it's so computationally expensive, I was really hoping this kind of computing would take off.


You nay also want to limit the client’s access to the data: e.g. limit the number or types of queries, which, depending on the encryption type, may require data to be processed on the server.


Your point alludes to the requirements one must consider before implementing private search: query privacy, and bandwidth efficiency. If I need only privacy, then do as you suggest. If I need only bandwidth efficiency, then reveal the query to the server. Thanks for bringing this up.


I'm getting a kick out of these performance numbers. It's a common refrain on HN condemning $BIG_COMPANY for not using homomorphic encryption for $CLOUD_SERVICE.

With these performance numbers in hand, I'm finding those posts retroactively hilarious.


Do people really condemn companies for not using FHE? Anyone who knows anything even remotely about FHE knows that its not practical in real applications yet, and won't be for a while.


I can't think of any such instances. I suspect GP is thinking of something else.


Is it really that common? I often see HN posts calling for end-to-end encryption, but not homomorphic encryption.

(But even "just" end-to-end encryption is more difficult than many realize.)


Most people asking for mandatory e2ee everywhere to protect their privace do so only because they have heard from someone that it is a good idea.

Most often they do not understand the tradeoffs that come with mandatory e2ee (difficult information transfer, easy loss of data, no server-side search, etc), and are not ready to deal with them.


You should compare them to what the performance numbers where 5 and 10 years ago. Homomorphic crypto systems have become many orders of magnitude faster quite rapidly. While they are still no where near non-homomorphic solutions (and likely won't ever be), they have only recently become practical for applications where performance doesn't matter and privacy is valuable (in the sense that people pay a premium for it, which is often not the case).


Wow, I don't remember the details but I did a thorough review of the state of FHE, as well as playing around with some implementations, ~5-7 years ago and this is orders of magnitude faster than what I would expect from that.

From a quick glance, this sees to be a major enabler (2016): https://tfhe.github.io/tfhe/

The project linked in the submission uses HElib: https://github.com/homenc/HElib/


It doesn't look like HElib implements TFHE, though. Are BGV and CKKS equally fast?


Both BGV and CKKS are so called batched schemes, which allows you to pack multiple values into a single ciphertext (typically thousands) and perform operations on all of the values for the cost of a single homomorphic operation. Batching is one of the larger sources of speedup since the discovery of fully homomorphic encryption.

The values in BGV and CKKS are integers and both schemes allow homomorphic integer multiplication and addition (CKKS gives approximate results). TFHE is actually not a batched scheme and moreover operates mostly on the level of single bits and boolean gate operations. It is quite difficult to directly compare TFHE with batched integer schemes.

Which scheme is best will ultimately depend a lot on the requirements of the application. If you're operating on integers and don't need bitwise operations, then an integer scheme is appropriate. Applications that need bitwise logic and can work with small bitwidths will be more suited for TFHE. CKKS is especially good for numeric and machine learning applications, because the approximation it implies can be managed and it allows using faster encryption parameters than BGV/BFV would.


Reminds me of that time I used a full week to raytrace some logo...


Ah, the 80s...


Yes. Homomorphic encryption databases are totally unrealistic in practice. It’s a common BS justification for research to make it “practical.”

(There are real applications, particularly in privacy preserving financial cryptography. But the database application is just a BS non-niche justification for grant purposes.)


Currently totally unrealistic in practice, but it's proven that the algorithms are in P (polynomial time), and it's just ("just", I know...) a matter of getting the constant factor overheads down. It's definitely desirable, because it means some untrusted host could perform operations on your data without ever knowing what they were operating on e.g. "I added your numbers but I don't know what they are or their sum."


I'm reading this thread and I still can't figure out the point or how it protects things.

What client would want to add some numbers without knowing what the result is?

I can't imagine any interaction with a database where I don't need to do an operation with the resulting data (even if it's just passing it into a different API/DB unchanged).

My interpretation of your example in human terms is "add your day of birth to your bank balance, but don't tell me the answer" and then walking away. What am I missing?

-- edit --

Maybe I'm looking at it in the wrong direction? Is it more like "here are some numbers that, add them up on your calculator but don't look at the result before showing it to me"?


The latter is more like it. It is a case where you want something computed, like the sum of your paychecks over the last year, Or some credit card risk evaluation thing, or if you have markers for cancer in your genome data based on your personal information. Instead of sending these values to the server (perhaps encrypting in flight) where they are processed in plaintext (open to malicious intent on the server-side, or honest-but-curious folks who mine your sensitive personal information) the values, you upload remain encrypted so only you know what they mean. However the cloud side can sum them, perform threshold evaluations, search for things, determine fitness for a loan etc, without knowing anything, they go through the motions of the computation for you, but without being able to decipher anything about the computation result at any step along the way. Then when the server-side is done, it has computed whatever you asked it to do, but knows nothing. The server side reliably and deterministically manipulated symbols in a language it cant read. As it turns out, you can, so when the encrypted server-side results are sent back, you decrypt them and understand if you have been approved or have genetic markers for cancer or something.


“This thing I’m handing you is a $10 bill. Also, I have a proof that the serial number on it is real, and not duplicated which I can share (the proof) with any third party, to prove my payment or validate the ledger of payments, without revealing which serial number it is (thereby preserving your privacy).”

That is essentially what zcash is. So as I said, numerous applications in financial crypto (where we can reasonably spend full seconds on desktop, or minutes on secure hardware grinding away at proof to send funds), but not so practical to spend an even greater amount of time on each database query.


Polynomial time is pretty bad I feel and not due to any constant factor. A search in a regular database with an index is either O(1) or O(log(N)) depending on the design.


Indexing a N-element array (a O(log N) operation frequently approximated as O(1)) inside a homomorphic enclave requires O(N) operations. Constant factors exacerbate the problem, but any usefully large database is wholely unsuitable for homomorphic use.


Hmm... I don't understand this. O(N) is very fast for most realistic purposes I can come up with. It's faster than sorting - O(N log N) - which I was taught is "basically a free operation". (Coursera class with Tim Roughgarden.)


Sorting (or any other O(N log N) operation) is only "basically a free operation" for small N, and if you're not doing it very often. A billion times log base two of a billion is roughly thirty billion. If an operation takes a billionth of a second, your "basically free operation" is taking thirty seconds. Also, you definitely want to avoid doing an O(N log N) operation more often than you have to. For example, doing

  for element of listB:
    listA.add(element)
    listA.sort()
is going to be much, much slower than

  listA.sort()
  for element of listB:
    listA.insertInSortedPosition(element)
since "insert in sorted position" should be O(log N) — just use binary search to find the correct position. The first algorithm is O(N^2 log N) while the second algorithm is O(N log N). For any non-negligible N, you definitely don't want to be running in polynomial time if you can avoid it! And can you guarantee your algorithm will never be run on non-negligible N?

I mean, yeah, sometimes you can. And in that case, "premature optimization is the root of all evil," sure. If you're just pulling a couple hundred product records out of a database and then sorting them in code, then yeah, an O(N log N) sort is "basically a free operation."

But sometimes you want to do more complicated things than that, and then it's not true. So overall I think it's pretty irresponsible to teach that sorting is basically free (without caveats). That's definitely not always true!


O(N) search is not fast.

Hash tables and trees are O(1) or O(lg N) which are vastly faster than O(N).

Imagine a database with billions of records. O(N) search on this database will absolutely murder performance.


I am also an amateur in these things. If indexing is O(N), does that mean it has the same performance as a linked list? Basically that would mean you have to go through each and every item until you reach the index you want.

Does homomorphic encryption allow running operations blindly on an encrypted dataset and return encrypted results without ever decrypting the dataset?


We've had lots of problems over last few decades which went from "practically impossible" to "doable on home computers", whether that's with improvements in software or hardware design. Extreme cases - realtime raytracing and fluid simulations, hardware-assisted encryption, practical NN.

Why do you think HE value map cannot be ever improved?


I'm of the opinion it can be, but i don't know if a comparison to the era of moore's law is really compelling, since we are not in that world anymore.


It has current practical applications in inter-bank fraud detection, where you want to query if someone is on someone else's blacklist but not reveal the blacklist or the person being queried.

These applications can easily handle an 80 sec delay. It's also likely that queries scale well by batching into larger combined queries (since you need to involve every row already anyway).


Because I was confused by this scenario I wrote down the steps:

1. Alice provides Bob with her public and evaluation keys.

2. Bob encrypts each row of the blacklist with her public key.

3. Alice encrypts her query (e.g. for "Eve") and sends to Bob.

4. Bob evaluates some circuit on Alice's query and Bob's rows, and sends her one (encrypted) result per row

5. Alice decrypts the rows and sees if any are ones.

Ways this can be made more efficient:

* Outputs from each row's evaluation can chain into the next evaluation, so Bob only need return O(1) data.

* Bloom filters could cut down on average-case performance, but that would make Alice leak information (whether she follows up her FHE bloom filter query with a more precise query.)


A bunch of the papers I found from a quick Google search use secure MPC rather than FHE for interbank fraud. Which one is more practical these days wrt latency?


80 seconds is not that bad.


Isn't that about 1000x slowdown?


Closer to 1,000,000X, I’d say.


There's only 50 countries in Europe. This is taking over a second per row in the database. I'm fairly sure a 1950s computer could do better using a regular algorithm.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: