Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Read Wikipedia privately using homomorphic encryption (spiralwiki.com)
331 points by blintz on June 8, 2022 | hide | past | favorite | 116 comments
Hi, creator here.

This is a demo of our recent work presented at Oakland (IEEE S&P): https://eprint.iacr.org/2022/368. The server and client code are written in Rust and available here: https://github.com/menonsamir/spiral-rs. The general aim of our work is to show that homomorphic encryption is practical today for real-world applications. The server we use to serve this costs $35/month!

A quick overview: the client uses homomorphic encryption to encrypt the article number that they would like to retrieve. The server processes the query and produces an encrypted result containing the desired article, and sends this back to the client, who can decrypt and obtain the article. A malicious server is unable to determine which article the client retrieved. All search and autocomplete is down locally. The technical details are in the paper, but the high level summary is that the client creates a large one-hot vector of encrypted bits (0’s except for the index of the desired article, where they place a 1) and then the server computes something like a ‘homomorphic dot product’ between the query and the plaintext articles.

I’d like to caveat that this is an in-browser demo to show it is practical to use homomorphic encryption at this scale. As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.

Happy to answer any questions!




This is the first thing out of homomorphic encryption I personally have seen that seems to be in the ballpark of useful for some practical use, which is impressive. Have I missed out on any other such things of interest?

(And this is not a criticism; this is a compliment. You start so far behind the eight-ball with homomorphic encryption with regard to the resources it consumes I wasn't convinced it was ever going to be even remotely useful for much of anything. Precisely because I was so skeptical, I am that impressed to see something work this well. It's not the fastest Wikipedia mirror, but... honestly... I've been on slower websites! Websites with far less excuse.)


There has recently been a lot of great work on making homomorphic encryption more practical for particular applications. We definitely stand on the shoulders of all that work!

One reason homomorphic encryption has a reputation as absurdly slow is that people typically talk about “fully” homomorphic encryption, which essentially means that you can compute an arbitrarily large function on the encrypted data. This involves a very expensive process called bootstrapping, which incurs a roughly ~15 ms cost per binary operation evaluated. As you can imagine, that adds up.

This work uses “leveled” homomorphic encryption, where we only perform a function of ‘bounded’ size (that is, the homomorphic dot product). So, we do not have to perform bootstrapping, and thus avoid some of the more extreme costs.

The other reason this work is practical is that we ‘tailored’ our construction to the particular problem of private information retrieval. When people try to apply homomorphic encryption generically to their problem, they typically end up with disappointingly slow and expensive results. Cool work on better ‘FHE compilers’ so in progress, so hopefully that will also help!


> can compute an arbitrarily large function on the encrypted data

What is the meaning of the word "large" here? How are we defining the size of a function?


Either the amount of Boolean logic operations or the amount of integer additions and multiplications, as these are the operations FHE schemes support.


It's generally multiplicative depth; that is, the number of ciphertext-ciphertext multiplications that feed into each other. The primary way we define and evaluate functions in FHE is as Boolean (or sometimes arithmetic) circuits, so "depth" is in that context.


You missed the widely panned Apple iCloud child sexual abuse imagery detection feature. The private set intersection is basically doing homomorphic encryption. In raising some very valid policy critiques, people forget that it's actually a nifty piece of engineering. (This is not an endorsement of that feature.) https://www.apple.com/child-safety/pdf/Apple_PSI_System_Secu...

I'm also closely working together with a team at $WORK who's using a protocol very similar to Apple's but not doing CSAM detection. We are seeing some severe pushback on this technology. I wouldn't be surprised if there are multiple homomorohic encryption based products at Big Tech that have yet to see the light of day.


I think it is very counterintuitive---even to professional programmers---that you can compute things without letting the computer know them. I literally had to go through the entire paper to understand that this can actually work as long as human in the loop doesn't screw up (see my summary [1], which was revised multiple times at that time).

[1] https://news.ycombinator.com/item?id=28223141


Apple's Differential Privacy uses homomorphic encryption for parts of its functioning.


Citation?


Interesting! But, it'd be helpful to clarify further the strength of the following claim:

> This demo allows private access to 6 GB (~30%) of English Wikipedia. In theory, even if the server is malicious, it will be unable to learn which articles you request. All article title searches are performed locally, and no images are available.

In this demo, the number of article-titles is relatively small – a few million – & enumerable.

If the server is truly malicious, and it issues itself requests for every known title, does it remain true that this "Private Information Retrieval" (PIR) scheme still gives it no hints that subsequent requests from others for individual articles retrieve particular data?

(Presumably: every request touches every byte of the same full 6GB of data, and involves every such byte in constant-run-time calculations that vary per request, and thus have the effect of returning only what each request wanted – but not at all in any way correlatable with other requests for the exact same article, from the same or different clients?)


Indeed! The encryptions of the client queries are fully semantically secure - under relatively solid lattice-based cryptographic assumptions, a server would need to do more than 2^128 work to recover the plaintext of someone’s query. One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.


> without the client’s key

Thank you, these 4 words really helped with my understanding so I'm calling it out incase it helps others. So I was thinking, what prevents you from replaying the query and getting the same page back? But it seems the answer is: that would only produce a gibberish response because you don't have the key.


How indistinguishable is a Not Found result by the server? It seems like user behavior would be to re-request the article a second time, so the client should probably protect the user against this kind of server (which could bisect on article popularity to find the requested article in ~20 tries) by throwing up a warning about an article in the index not being retrievable.


Indeed, client behavior when an article fails to be retrieved is very important. A malicious server could guess what article is being retrieved, remove it from the database, and see if the client 'reacts' in some observable way.

The mitigations for this are somewhat heuristic; the client should perhaps 'pace'/'debounce' subsequent requests when a retrieval fails. For example, enforce that the re-retrieval will happen at a uniformly random time between 1-5 seconds later.

It's good to note that this kind of attack is somewhat unavoidable, since denial-of-service is (generally) always possible. An interesting mitigation might be for the server to produce a ZKP that the full dot product was computed correctly, and have the client always check the ZKP before displaying the article. This is mostly a theoretical fix for now I believe, since ZKP's over homomorphic computation are in the realm of theory more than practice.


If the protocol is that the server returns a specific "index N deleted" result for deleted N then the client can at least trust a valid response from the server as opposed to a DDoS or unmasking attempt. Any server that returns something other than a valid document or "N deleted" should no longer be trusted or communicated with, but retries for communication failures or timeouts should still be safe.

Edit: this assumes that the client gets a trusted index from a set of trusted servers who are at least as up to date as the latest index that is made available, which timestamped signatures or similar should suffice for. Or the client can supply the timestamped index signature and the server can reply with a different message if it's too old.


I think their model doesn't allow you to express such a thing? At implementation level the client requests an id in a fixed range and gets back a 100KB chunk corresponding to that id; if there's no article for that id then it'll just get a chunk of all 0s, and to anyone without the client's key that's indistinguishable from a chunk with content in it.


Would this be vulnerable to a side-channel attack as follows?

1. Record what item was retrieved from disk for a query

2. Run a dictionary through the query system, and see which item matches the record


The server processing has to scan the entire database to answer a query- so the entire database stays in memory, and access patterns tell you nothing about the query.


Notably to some who may misinterpret the "stays in memory" aspect: even the copy in memory is fully scanned by every query. So it's not just disk access patterns that don't give anything away, but RAM access patterns.


Can this be applied usefully to non-public datasets?

Would it be feasible to add some other zero knowledge proof to this that would confirm a user has paid a subscription for access? For example, if this were a news site, the user would have to prove a valid subscription to read articles, but the site would not be able to know which articles any subscriber decided to read?

If that is possible, what could the site to to prevent a paying subscriber from sharing their access to an unreasonable number of others? Would it be possible to impose a rate limit per subscriber?


The simplest approach would be to just charge people per-query (or charge in levels, depending on the number of queries). This could be done in the standard way (you have to log in, pay the site, and then it gives you an API key or just session token, and logs how many queries you make). I think you can avoid having to use a ZKP this way, since that will make things much more complicated and possibly costly.


I'm no cryptographer, but it seems to me you could implement this using the same algorithm as cloudfare for tor, which generates anonymous tokens from an adhoc webpage


In another comment you’ve said:

> With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)

Is this technique therefore practical only when the server side dataset is relatively small (or full scans for every query are tolerable)?

(edit: sorry, misattributed the quote)


Wasn’t me, but it was accurate!

Indeed, except in some (exciting!) new theoretical constructions, server work is always linear in the database size.

However, I’d emphasize that our work shows that the constabt factor on this linear operation is really high. We process at 300MB/s to 1.9GB/s on a single core, which is fast enough for relatively large databases. Remember that the computation is embarrassingly parallel, so you can always throw more compute at larger databases. To summarize, we think the speed is now fast enough that it really can be feasible to scan the whole database to answer a single query.


> except in some (exciting!) new theoretical constructions

Sorry if this is too much of a tangent, but I would love to know what these are!


They are called PIR with sublinear online computation or offline/online PIR.

Key idea is the client issues a query that is independent of what they really want. This is the “offline” query. This query is linear (unavoidable) and the client gets some hints as a result

Then later, the client can issue one or more queries for elements they want and those queries can be sublinear in terms of computation. The client uses hints to make that happen.

Some papers on this are:

https://eprint.iacr.org/2019/1075.pdf

https://eprint.iacr.org/2021/1438

https://eprint.iacr.org/2022/081.pdf


Sure! This paper has people really excited from a theoretical standpoint: https://eprint.iacr.org/2022/081.pdf. It’s a bit far from practical for single-server, but I’m sure people are working on it.


Clarification: that was my comment, not OP's. I'm not a cryptography expert, just an interested amateur. But my understanding is that O(n) query times are inevitable if you want information-theoretic security. Maybe it's possible to do better with a weaker security property.

And there are clever ways you can make a system like this "scale" even if the overall dataset size is limited. For instance, the authors cite another interesting paper [1] that uses a similar technique to build a fully-private voice chat system. The basic idea seems to be that you build a "database" consisting of the most recent snippet of audio from every ongoing conversation, and let each client privately query for the segment that's addressed to it. And every fraction of a second when new audio data arrives, you just throw away the old database and build a new one, so the amount of data doesn't depend on the length of the conversation.

Even if this doesn't scale to arbitrary numbers of users, it could still be used to run a "cell" of a few thousand people, in which it's not possible for an adversary to reconstruct communication patterns within the cell.

[1]: https://www.usenix.org/conference/osdi21/presentation/ahmad


Maybe the I/O pattern could be hideable using confidential computing, like with a Nitro Enclave.


Maybe, but if you have a secure enclave that can be trusted not to leak data, then you don't really need PIR. You can just have clients encrypt their queries with a key that can only be decrypted by the code running in the enclave.


Could this be used for DNS?


This is a great idea, and we think it would be relatively practical assuming some aggressive caching. However, I couldn’t think of a threat model where this is useful, since presumably your ISP can in the end always see which sites you visit by simply reversing the IPs you connect to.

Do you think that people would want private DNS? I suppose it would still be an improvement over the what we have today, but I’m not sure that it will make it meaningfully harder for ISPs to collect and sell data to advertisers.


On threat models, a malicious DNS server might also be one compromised by a party demanding wiretap access.

Regardless, a person today has a choice of which DNS server to use but they all could track the requests made. Tracking site visits via IP is a different link in that chain.

Would people pay? I don't know, but I could see it being a feature used to different a VPN service from its competitors.


That's a good point, I could see this being a differentiating feature for a VPN provider. The only way to know if people would pay is for someone to offer it, I guess...


With Encrypted Client Hello (ECH) https://en.wikipedia.org/wiki/Server_Name_Indication And large security proxy networks like CloudFlare, ISP cannot no longer know sites only by IP address/tapping traffic.


OCSP would be a good target in the similar space: https://en.m.wikipedia.org/wiki/Online_Certificate_Status_Pr...


I do think OCSP and certificate transparency stuff is a clear application for PIR that makes a lot of sense. There, you currently basically have to volunteer your browsing patterns to a 3rd party (beyond your ISP).


For SCT audits on the browser, see https://www.lightest.eu/static/papers/10-Revisiting%20User%2...

Google’s CT design docs also state PIR Audits as the long term goal, so this would be a good area to focus on. https://docs.google.com/document/d/1G1Jy8LJgSqJ-B673GnTYIG4b...


There are ways to hide IP endpoints from relaying nodes, as well.


Yes but it would be quiet expensive infrastructure wise. Also i think numbers of website grows faster then processor io meaning one would need more and more processors with time.


Last year, there was a detailed presentation with several speakers on state of the art Secure Multi-Party Computation for practical applications in healthcare, fighting financial crime and machine learning from CWI (Centrum Wiskunde & Informatica) Netherlands. The recording is here (2,5h): https://www.youtube.com/watch?v=gE7-S1sEf6Q


> A malicious server is unable to determine which article the client retrieved.

This sounds like magic :O. How does it behave when new articles (elements) are added, does it need to rebuild the whole database and distribute new parameters?

I wonder how practical it would be for clients to synchronize content without server not being able to deduce the synchronization state at which the client is.


It does sound like magic! This is what got me into this field; it seems like something that should intuitively be impossible… but it’s not!

Parameters only need to be changed based on the number of items in the database (not the content of the items). Also, they don’t really need to be modified as long as we are within the same power of two number of items. So, I think client and server agreement on parameters seems feasible. Right now, it’s just hard coded :)


Does homophobic in this case mean that I can edit the content of an article and the diff is directly applied to the crypt?


This has nothing to do with editing Wikipedia. The problem this demo is solving is "private information retrieval" -- that is, you send a query for a particular article, and the server sends back a response containing that data, but the server does not learn which article you asked for. Homomorphic encryption is just one of the building blocks used to achieve this.

A trivial solution to this problem would be for the client to just download a copy of the entire Wikipedia database and pick out the particular article it's interested in. With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information) but the amount of information that needs to be sent to the client is much smaller.


Ah, I understand. I thought it was the usual presented use case to apply an operation to a crypt directly and was confused since the title already stated otherwise.


Unfortunately, it's not that trivial. I tried a few offline wikis with varying success, but haven't had a ton of success using it day to day.


Oh, certainly, I didn't mean to imply that there would be no implementation difficulties -- just that if you're willing to download the entire dataset, the privacy aspect is trivially solved, without requiring any fancy cryptography.


Quite the curious typo you have there.


I wouldn't be surprised if they originally wrote it correctly and a phone's autocorrect "fixed" it for them


Heh, just saw it. I wrote it on a desktop PC but that pretty surely was autocorrect still.


What is the maximum throughput the server can maintain? Or, in other words, how much does it cost per query?


The $35/month server uses 6 cores to answer a single query in ~2.5-3 seconds. So it’s 0.33 QPS :-)

Not high, which is why it might not be working well for folks right now…

Time scales almost perfectly linearly with cores, since the computation is embarrassingly parallel.

In terms of cost, we’re still talking only 18 CPU•s and ~300KB of outgoing bandwidth, which is not a ton at todays prices.


It's embarrassing parallel... but you also do N times more work than a non-homomorphic system so that's not saying much!

This doesn't seem like a particularly compelling application - can you give some practical problems that homomorphic encryption solves. I've always heard vote counting as the example.


Yeah, the computational overhead over no-privacy retrieval will always be high. Something we point out in the paper is that this does not necessarily translate to server monetary costs; for example, on AWS, where outgoing bandwidth is expensive, using our system to stream a movie is less than 2x the monetary cost of direct, no-privacy streaming. This is because the computational costs are just very small in comparison to the bandwidth costs.

As far as more compelling applications, several folks have suggested DNS, OCSP, and certificate transparency as potential examples. Using PIR as a building block, it is also possible to build more complex systems, like metadata-private messaging (systems like ‘Pung’) and even private voice calling (systems like ‘Addra’).


Extremely cool. Now we can serve content without any ability to observe what people are being served exactly. I was hoping that someday soon such technology could be used to serve search results and give us a truly private search engine experience.


Theoretically, can this scheme be turned into a generic O(N) key-value retrieval for non-static content (in this example — supporting adding, removing and replacing articles without re-encrypting the whole database and re-sending the client setup data)?


We never encrypt the database. Only the query is encrypted. The client setup data is only dependent on the client key and the size of the database (not the content). Adding and replacing articles can happen whenever the server wants, and clients do not need to "re-sync" or something like that.

For arbitrary key-value retrieval, a hashing scheme would work pretty well, modulo some imbalances that will occur.


Not able to read the full paper at the moment, and confused about something:

If the server needs to go pull the article from Wikipedia, how is it blind to which one is being requested?

If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?

The only way I understand this works is if the client itself seeded the encrypted data in the first place (or at least an encrypted index if all the server pushes back is article numbers).

Maybe I'm ignorant of something; if so thanks for ELI5.


> If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?

With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.

There is no usable information to retain.

So the question becomes: what can you calculate using arithmetic on secret numbers?

Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.

The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."

Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.


Thank you. I really wish the paper started with exactly what you wrote as the abstract.


does revealing that you accessed something from the index vs nothing a leak of information?


The server can't tell if or which articles you accessed in a query.

If the server has removed an article, then it's possible it would send back a broken response to you, but the server would not be able to tell if that happened. There would only be an information leak if you reacted to the broken response.

If you mean observing that you made a query at all, versus not making a query, sure that's observable.


The server has a snapshot of Wikipedia, sitting in memory. The server is blind to which article is requested because it computes a homomorphic dot product between an encrypted one-hot vector (encrypted under a key that only the client knows) and the total set of articles in plaintext. The encrypted query does not reveal anything about the requested index (ie it is semantically secure).

The 'magic' of homomorphic encryption is that the server is able to take an encrypted one-hot vector of bits, and a vector of plaintext articles, compute a homomorphic dot product, and produce a single ciphertext encoding the single desired plaintext article. This single output ciphertext is crucially also encrypted under the client's secret key, so it reveals nothing to the server.


Thanks. I knew homomorphic encryption let's you do math on encrypted data without knowing its contents. But it hadn't occurred to me you can keep one side of the arithmetic in plaintext. It also helped when gojomo pointed out that every query touches every byte of the full dataset.

So basically (if I've got this right and at the risk of over-simplifying): Client computes a bit mask and encrypts it, sends that to the server, and says "do a big multiplication of this against Wikipedia content". That leaves only the relevant portion in the encrypted result, which only the client can decrypt.

How does the client know which bits in the initial "mask" (aka one-hot vector) to set? Does it have to download a full index of article titles? (So in a sense this is just a big retrieval system that pulls an array element based on its index, not a freeform search tool).


Yes, all of the mapping of article titles to index happens locally.

If the index got too large, we could use a hash to bucket things, but this incurs some waste and would only support exact article title matches.

Yes, it is definitely not a freeform search tool.


Don't you have to pad the output ciphertext size to match the largest article you could possibly request from the set of articles? Or is a fixed-size output an inherent property of homomorphic encryption schemes? Otherwise it seems to reveal something to the server just by the size of the ciphertext (since WP articles vary in size).


The output is always going to be fixed size.

The nicer way to look at homomorphic encryption is like math, but the more literal way is like a bunch of logic gates. There are N output wires. You control the bits on the input wires. Nothing you do will change the number of wires.


Yes, every item needs to be the same size. We batch the articles into 100KB chunks, and we split articles larger than this into multiple chunks.


Wouldn't you then have to send out multiple ciphertexts (for articles >100 KB)? Which would leak something about the size of the article...


You would. It’s important for the client to pace it’s requests in a way that does not reveal too much (for example, the client should not just request both chunks of the article at the same time). The best thing to do would probably be to have a ‘load more’ button at the bottom of a very long article that makes a separate request.

If you think about it, the pacing of user queries could always reveal something (if there’s a long gap between quarries, perhaps it’s a dense mathematics article?). So the best we can hope for is pacing requests either randomly, or for even further protection, perhaps making dummy requests on a fixed cadence.


The server already has a full copy of (its subset of) Wikipedia.

Every query touches every byte of that snapshot, in the same way... but the homomorphic-encryption math distills that down to just the fragment the client requested.

And it does so without giving even the CPU executing that math any hint of what bytes/ranges/topics survive the math. The much-smaller – but still, at every server step, encrypted – result is then sent to the client, which performs the decryption.

It's moon-math magic.


Can this functionality be implemented as a peer-to-peer (or federated) service?

I'm assuming it'll depend on breaking down questions into hierarchical sub-questions that can either be recomposed locally or in another homomorphic context. But can that sort of thing be done without data-leaks, or prohibitively expensive inter-node communication?

Are there any introductory resources (that you know of) on homomorphic encryption and compute that'll turn this into less of a mind-fuck?


NuCypher is doing some really fun stuff in this space: https://www.nucypher.com

Basically, you upload encrypted blobs to a P2P network, and you can issue special proxy-re-encryption keys to people that allows them to download the encrypted content, without revealing any of the content to people running the nodes (who store and replicate the data). You can also do really interesting things like revoking keys to remove access for people who haven't downloaded the blob yet.


It would be very cool to federate this! I am particularly interested in applications to DHTs/Kademlia, especially for things like IPFS.

Since this has only recently become practical, there is a bit of dearth of resources on it at the moment. I’m going to try to do my part and write a blog post at some point.


I understand that you do some kind of dot product (with two steps, Regev and GSW). However, it looks to me that those steps involve fixed dimension vectors.

- How do you handle variable length data? Do you need to pad it?

- What is the memory overhead of the storage of encrypted data?

I think that at least for video data, the streaming scheme "leaks" the size of the encrypted data with the number of streaming packets.


Yeah, every record needs to be the same size. For the demo, we batch the articles into 100KB chunks. We do a good job packing, where we put many small articles into a 100KB chunk, and we split large articles into many 100KB chunks. This packing is pretty efficient, roughly 90% of space is not wasted.

The memory overhead is significant but not prohibitive… we have to keep something like a 5-8x larger encoded database in memory, but this overhead is from encoding the plaintext in a special format to allow a fast dot product, not from inefficiency in the packing.


Is it not possible to determine which article(s) the user downloaded based on the memory locations read? Of course, multiple small articles from within the same 100KB cannot be said, but for any medium to large article, you'd be able to make a good guess (if there are a handful of articles there) or an exact match (if there is <=1 article in that chunk) no?

Or does the server go through a large chunk of its memory (say, at least a quarter of all of Wikipedia) and perform some oblivious computation on all of that data (applying the result modulo this 100KB return buffer)? That sounds very resource-intensive, at least for something large like Wikipedia (a doctor's office with some information pages of a few KB each could more easily do such a thing).

In the latter case, is each request unique (does it involve some sort of IV that the client can xor out of the data again) or could an index be built similar to a list of hashed PIN codes mapped back to plain text numbers?

Edit: I had already read some comments but just two comments further would have been my answer... :) https://news.ycombinator.com/item?id=31669630

> > With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)

Now let's see if I can also find the answer regarding the ability to build a lookup table...

Edit #2: found that also! https://news.ycombinator.com/item?id=31669924

> One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.

That is some cool stuff indeed. I'm going to have to up my game when building or reviewing privacy-aware applications. Sure, a file sharing service is not going to practically allow this, but I'm sure that with this knowledge, I will come across places where it makes sense from both a usefulness (e.g. medical info) and practicality (data set size) perspective.


> Sure, a file sharing service is not going to practically allow this

Well, as the author points out here [0], it doesn't actually translate to exorbitant cost increases when almost all the cost is in bandwidth rather than compute. A file sharing service rather seems like an ideal example (assuming you can derive additional revenue from the privacy promises).

[0]: https://news.ycombinator.com/item?id=31673122


I'd be interested in the calculation there. How many movies does the server store in the first place? Computing over all of Netflix (just their movies, not even the series) for every request, even if you could then obtain the full movie with just one request, seems almost prohibitive. I'm also assuming the author counted on the server being busy the whole time (on-demand pricing) and not having to have spare capacity idling for when multiple requests come in at the same time.

For me, compute has always been by far my expense, and that's without homomorphic encryption! Buying a server or the monthly rental (e.g. VPS) costs is where most of my budget goes. Next on the expense list are a few domain names, and probably about as much in electricity if we're considering the at-home scenario. Bandwidth is usually included, be it with the VPS or with the server I run at home (since I want that internet uplink for normal use anyway).


If you say a malicious server can't determine which article was retrieved, is that private information retrieval (PIR)? Something must be different here. I thought there was a theorem that for single-server PIR to work, the client has to download the entire DB, which is the right way to read Wikipedia privately anyway.


This is PIR. You do have to download the whole database for information-theoretic security, but not for computational security. If you assume the hardness of some problem (in this case, lattices, but it is also possible from RSA, ECC, etc) it is possible to do much better than simply downloading the entire database.


Does the server have to scan the whole database on every query? If not, doesn't the disk access pattern tell you what the query was? I had thought you had to download the whole DB even for computational PIR, but hmm, maybe not in some cases where there is only 1 client with a secret.


Yes, the whole database is scanned to answer a single query.


Do you have a blog or Twitter? I'd like to keep up with any other cool projects you're working on!


Not at the moment, but will probably make a blog post or something to explain how it all works at a slightly higher level than the paper.

When it’s done it’ll be at https://samirmenon.com/.


I wonder if this can be done on sqlite?

http://static.wiki/

See the previous news article. https://news.ycombinator.com/item?id=28012829


You mean distributing 43GB of data to everyone rather than uploading a few megabytes of data one time and getting 250KB answer chunks? I'm not sure it compares.


If you're going the sqlite route, the much easier path is to distribute the whole database and do everything on the client


Fantastic project.

Have you considered running (# of cpus) parallel scanners continuously? An inbound query “hops on” the the least-loaded scanner; at each article/chunk the scanner runs all the queries; each query “hops off” and returns after it has completed the cycle through the entire DB.


I do indeed use all 6 cores of my server simultaneously, and the caching on the database part of memory is quite effective. If I had a little spare cash I might just purchase some more CPUs :-)


Well but you get into the security space and license your server db technology for shit like IoT lights. I don't want the company knowing if my lights are on or off, but if they had a homomorphic encrypted backend and app, I might trust it.


To get press and earn legitimacy, make a hacking challenge: put up pcaps and a $1 million prize if anyone can break it.


This is wonderful! I've never seen anything like this in practical form.

I hope it doesn't become standard practice for general websites(As I imagine some would like to see), but it's an amazing tool and there will probably be many wonderful uses.


This kind of stuff gives some of the best arguments for open source software (OSS) to date. Otherwise, it has to be taken completely on faith, which then defeats nearly the entire purpose and makes the performance overhead untenable.


Homomorphic encryption actually causes some interesting asymmetries here. It is very important to have an open source client, since of course a malicious client could trivially leak your desired index. However, the server’s operation is totally untrusted… so it’s actually not important for the server to be open source. It is still nice for all the regular reasons open source is good, but interesting to note.


> As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.

Arguably, a malicious server could deliver a bad executable too.


Yeah, I mean you generally have to go with trust on first use at some point. You can also do code signing, check hashes, build from source, compare multiple sources, etc. All the standard software supply chain security measures.


Idea: Apply this to personalized advertising. Client sends his interests + habits + personal info encrypted to the server. Server finds and sends back to client the best ad based on clients info.


Can anyone recommend an explanation of this concept geared towards people with only a superficial knowledge of encryption?

This seems to be some kind of search applied on an encrypted dataset, is that right?


It's like, I send the server an encrypted math problem. The server has no idea what the math problem is, but homomorphic encryption allows it to compute an (encrypted) result and send that back to me. I get the result and decrypt it for the correct answer. It's novel because you don't have to trust the server with your math problems.


A fairly laymans explanation was posted elsewhere in the thread: https://news.ycombinator.com/item?id=31671914

Quoting it here for convenience:

> With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.

> There is no usable information to retain.

> So the question becomes: what can you calculate using arithmetic on secret numbers?

> Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.

> The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."

> Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.

If you found the explanation useful, you can upvote the original comment linked above


For homomorphic encryption in general I wrote this blog post to make the idea accessible:

http://blog.tyrannyofthemouse.com/2013/05/i-added-your-numbe...


I find it deeply ironic that the page is http not https!


Very nice! Great against snoopers that lack authority but for when they do have some authority (bosses, government) without plausible deniability it can do more harm than good.


Can you explain this comment more? Genuinely asking as I don’t understand the implication/downside?


The fact that you are evading surveillance will get you in trouble, on top of that false conclusions about what you were doing could be made. Maybe you were browsing academic content from Iran an example, you get in trouble for using it but also for browsing anti-islam content which you can't disprove but your accusers can make arbitrary false claims by correlating things about you with your evasive actions. Plausible deniability means generation of forensic evidence that serves as a cover. In my example it would be generating traffic that decrypts as harmless content or using stego to hide the real content inside cover content. And bundling the software as a feature of some other browser or extension that has other purposes that allow you to say "I just installed a pdf converter extension, I didn't know it let me read wikipedia with homomorphic crypto as well".


If someone wants to make false accusations, they can do that now? I can see authorities trying to prevent adoption of homomorphic encryption. Once widely adopted I don’t think merely using it could be used against someone. Similarly, regular encryption isn’t _really_ penalised against now that it’s widely adopted.


This sounds like the ultimate anti-user profiling and targeted advertising solution. I hope google and other advertising giants can’t stop this. Thoughts?


[deleted]


This is very cool OP! I interviewed to be a privacy engineer with Wikimedia a while back.

I suggested that my goal would be to add a v3 onion service. They actually had listed years of "homomorphic encryption" as a requirement. I phoned up the recruiter and basically said it's ok if there is a personality conflict, but the role as written was impossible to fill, and it scared me that very good suggestions for privacy as well as the health of the Tor network were discarded.

(If you set up a dot onion, that frees up traffic on exit nodes, whose capacity are limited.)

Big thanks to the OP for being willing to share this work, it's very cool and I'm about to read your eprint.

I'm excited about the potential of homomorphic encryption, though I worry about things like CPU cost -- I recall when folks had to really be nudged not to encrypt huge blocks of data with PGP, but instead use it to encrypt the passphrase to a Truecrypt volume using a symmetric cipher like AES.

(I'd love how to know we got to a point Twitter added an onion service then banned me, but Wikipedia continues to not even support MFA for logins -- I recently registered an account intending to eventually upload some art to the commons, but the perpetual refusal to allow folks to make healthy choices disturbs me.

In fact, after reading articles like these ones[1][2], it makes me question the integrity of the folks I interacted with during the interview process.

On my end, it was especially disturbing since prior to enrolling in my PhD, the alternative path I discussed was becoming an FBI agent focused on counter intelligence in the "cyber" realm.

The agent I spoke with told me I'd serve "at the needs of the bureau", so that would mean probably not using my computer skills, which would then languish, then after a couple years I might still not get my desired position, and gave me a card, which I eventually lost.

Years later, prior to the insurrection, I had to walk down to Carnegie Mellon and ask if anyone had his contact information, and was shocked that folks refused to even point me at a link to the lecture, which had been listed as open to the public.

I'm someone who reads Wikipedia, not really edits, but the vast majority of users are readers not editors, and this perpetual pattern of refusing to enable privacy enhancing technologies, paired with using privileges access to make hiring decisions against folks who lack the physical ability to make good privacy decisions offended me on a deep, personal level, and is why I often post in brash, erratic manner.

Because I see zero incentive to stay silent -- if I'm quiet, people will slowly drain my bank account.

If I post, there is a chance someone will see what I say, notice my skills, and offer full time employment. So I have to continue risking offending folks until I find a full time job, which I have not had since I left the Center for Democracy and Technology under duress following a series of electronic and physical attacks, paired with threats and harassment by staffers in the organization.

TL;DR: Great research, but I hope they also add an onion service rather than jump straight to using this :-)

[1] https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@list...

[2] https://slate.com/technology/2021/10/wikipedia-mainland-chin...


I tried, but I simply can't follow your train of thought. You keep going back and forth between criticizing Wikimedia hiring and technology choices, advertising yourself, and deliberating over onion services. And it all seems extremely tangential to the article (which is really not about the future of Wikipedia, or Tor, or your career).


My bad!

I worry they have insider threat issues that remain unsolved.

I hope they add an onion service.

I think the tech is cool, but issues about untested code aside, I worry about CPU overhead.

(However, on the last point, I suspect much like when we worried about CPU overhead,

I often feel like I have to speak exhaustively and at length to get my point across, but that may be a side effect of a large chunk of my professional network being K Streeters - they like to misunderstand on purpose then complain you explained things at length.

Is the above better? I can skip marketing myself if that's the issue - I just notice a persistent issue that folks say they need certain skills, I know I have them, but folks disbelieve me. Short of being arrested for a CFAA violation I'm not sure how to prove it to those types at this point, and I don't intend on doing that, LOL.

If you actually care, it's painful to see people destroy the things you care about.


How does the server select the article in a way that we can be sure they don't record the article sent back? Are the articles encrypted on the server too?


Yes, in fact the article number is not even decrypted by the server, so the server doesn't know which article you asked for!


How is this not vulnerable to side-channel attacks like disk-access patterns?

Could I, as a malicious server, request myself a target article and correlate that with legitimate user requests?


> With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)

https://news.ycombinator.com/item?id=31669370 (not original poster)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: