Hacker News new | past | comments | ask | show | jobs | submit login
Machine learning without centralized training data (googleblog.com)
568 points by nealmueller on April 6, 2017 | hide | past | favorite | 99 comments



This is one of those announcements that seems unremarkable on read-through but could be industry-changing in a decade. The driving force between consolidation & monopoly in the tech industry is that bigger firms with more data have an advantage over smaller firms because they can deliver features (often using machine-learning) that users want and small startups or individuals simply cannot implement. This, in theory, provides a way for users to maintain control of their data while granting permission for machine-learning algorithms to inspect it and "phone home" with an improved model, without revealing the individual data. Couple it with a P2P protocol and a good on-device UI platform and you could in theory construct something similar to the WWW, with data stored locally, but with all the convenience features of centralized cloud-based servers.


I think you could be spot on: there are new applications emerging in deep learning, like self-driving vehicles, where you have a powerful need for mountains of data to train complex models, yet a logistics problem in how to aggregate that data in a single place (I'm making this up, but imagine a car collecting 7 streams of 4K video at 60 fps). I really see a growing need for these types of distributed training models.


I'm a bit confused. Doesn't the data need to be cleaned up, annotated, or labelled before it can be used to train? How would this work if the data doesn't leave the local device?


Sounded like the label for training is the user's action (whether they chose the suggestion provided or not). Then they collect the model updates in bulk from millions of users, that's far more useful for training purposes than a small very well curated data set.


It did mentioned a local version of mini TensorFlow, I would guess the data is cleaned up on local devices and only updates are send back to the cloud.


There is a branch called unsupervised learning, where no labeling is needed.


There is, but supervised and unsupervised learning mostly don't solve the same problems - so this doesn't necessarily help


as I understand what I read the data does leave, but just the users updates - sort of like only your changes to a git a branch are sent when you push.


Not knowing enough about this, my first impression is:

How does Federated Learning cope with ... FAKE NEWS! (for example).


It is mostly applicable for taking data sourced at an endpoint (e.g. Mobile phone) and running what is essentially a refinement to the learning.

A key component is that to analyze the data in the cloud for the same refinement would mean sending the data to the cloud, which the user may not want, and may also be bandwidth intensive.

For fake news, the data is already in the cloud, being pushed down to the users device. A user could mark something as 'fake' (via explicit action, or possibly inaction), and that 'mark' is uploaded and the data is analyzed by central compute for refinement. To be clear, that is NOT what this paper is about, I am saying fake news would be a bad use, because the data is not being sourced by the user, only viewed and possibly marked.

A better example than the gboard gesture learning refinements might be in the form of an app that acts as a dashcam (leaving aside the kludgeyness of that). The user could mark things like proper recognition of street signs, traffic lights, or brands/models of vehicles. The phone would then analyze those classification, compute a diff to the algorithm, send just that diff to the cloud (summarizing a lot). Multiply that by 1,000,000 users, and now you have a refined data set that did not require sending 100,000,000,000 images to the cloud for analysis.


For federated learning to truly reduce incumbent advantage, won't newcomers still have to develop a platform attractive enough to entice device users to download their model and consent to data inspection? Or is the idea that Google or another large provider will be the point of entry for various algorithms to connect to the datastream, as a sort of cloud service? How many models can co-exist on one device?


Which is why I am surprised seeing this come from Google. Everyone has already admitted they are fine sending all of their data to them which benefits them greatly.


I think this may be a Xerox Alto, IBM PC, or Sun Java moment. In the short term I can see a clear benefit to Google for this. They want to get machine-learning into more aspects of the Android mobile experience, Android customers are justifiably paranoid of sending things like every keystroke on the device back to Google's cloud, and so this gives them a privacy-acceptable way to deliver the features that will make them more competitive in a new market. Remember that the vast majority of Google employees honestly want to do what's best for the user, not what preserves Google's monopoly.


The vast majority of people in any organisation are good people who want what's best in the sense of the greater good - that does not prevent organisations from doing bad things.


>> The vast majority of people in any organisation are good people who want what's best in the sense of the greater good

No. The vast majority of people in any organisation are timid folks who want what's best in the sense of the greater good, unless the greater good involves any courage on their part. These people are congenial, but don't confuse those with good people.


I have never heard it put so pithily, nice.


I don't hold the leadership at Google with the same contempt as I hold the leadership at say Verizon, GE, Ford, or News Corp (WSJ). I'm sure given enough time all corporations are categorically evil but Google isn't there yet.


Oh god. Has YouTube drama escaped to Hacker News?


Google's internal training emphasizes to do the right thing and compete fairly%, going so far as to not use terms in PR or even internal email such as 'crush the competition' 'dominate' 'destroy', and always doing what's good for the user, rather than bad for the competition.

% and often mentions competition/monopoly laws


There's nothing altruistic about that. Emails with those words will cause problems for the legal department when they come up in discovery during antitrust litigation.


Wittgenstein disagrees.


Google forcing their employees to go through training to avoid bribery, sexual harassment, and antitrust problems for the company is not due to anything other than saving the company money. To be pedantic, the disagreement with GGP is not whether the actions are altruistic but whether the actions were done out of altruism.


It's easy being altruistic when you're the clear leader and have a comfortable margin. Doesn't give me any comfort knowing how benevolent Google is presently.


True, but there are some kinds of data that people are still uncomfortable sending to Google. Medical data is considered especially private, and aggregation of medical data is a huge obstacle to improving treatment using ML. I think this could be really huge in that space.


There is a movement to change consent forms (which patients sign as they enter a medical system) to permit larger sharing of medical data outside of its direct use. IE, you are offered an opt-in to permit your data to be analyzed beyond an individual visit, possibly for medical issues completely unrelated to any immediate medical problem. The consent forms are transparent, and opt-in- the health consumer is informed what their data will be used for, and they explicitly have to say it's OK (blanket consent, can be revoked).

I think this is a win, because the consumer has the choice, and if enough people do it ,the resulting aggregated datasets will have exceptional power to help solve global medical problems.


I'd say they reason they want your data is to make money, if they can do so without sending your data to their datacenters, why not?

If a goverment wants more power it makes sense to be able to read and store the data in their datacenter.


It's not training a very computation intensive activity?

Maybe Google could be interested in move some of the processing load to the users devices.


This maybe also dovetails with deep mind (?) or openai (?) article yesterday talking about adapting and updating general models with limited data for significantly different tasks.


Which is already the case like for 10 years. That is why no Startup ever challenge Google for their Search. It is basically the Matthew Effect but on Web Scale.


I'm not too sure. There seems to be a spot for more niche search engines on the market. Take DuckDuckGo for example, it doesn't have that much of a market share, but it's a good option for people who care a lot about their privacy.

I would also consider WolframAlpha to be in a similar category as Google in the sense that they're "answer machines". I think there's a market opening for a search engine with the wisdom (knowing what you're looking for) of Google with the Intelligence of WolframAlpha (being able to draw curves that look like Darth Vader).

For example if you search "Darth Vader curve" in WA, you'll get this: http://m.wolframalpha.com/input/?i=Darth+Vader+curve&x=0&y=0

But if you can't remember exactly what it's called and search "Darth Vader graph" you'll just get redirected to the page for Darth Vader. Ideally, a WA would know what I'm looking for ("Darth Vader curve") and show it to me instead.


In a decade? Lol. This is being used today, right now, in many different forms. Trust me on that. Especially in the security field. The idea that Google just magically discovered this is cray cray. That being said, their contributions are hugely welcomed.


Their papers mentioned in the article:

Federated Learning: Strategies for Improving Communication Efficiency (2016) https://arxiv.org/abs/1610.05492

Federated Optimization: Distributed Machine Learning for On-Device Intelligence (2016) https://arxiv.org/abs/1610.02527

Communication-Efficient Learning of Deep Networks from Decentralized Data (2017) https://arxiv.org/abs/1602.05629

Practical Secure Aggregation for Privacy Preserving Machine Learning (2017) http://eprint.iacr.org/2017/281


Reminds me of a talk I saw by Stephen Boyd from Stanford a few years ago: https://www.youtube.com/watch?v=wqy-og_7SLs

(Slides only here: https://www.slideshare.net/0xdata/h2o-world-consensus-optimi...)

At that time I was working at a healthcare startup, and the ramifications of consensus algorithms blew my mind, especially given the constraints of HIPAA. This could be massive within the medical space, being able to train an algorithm with data from everyone, while still preserving privacy.


I think the distinction here between "handing over your data" and "letting a model train on the data on your device" may be more subtle than you might think. There is still no guarantee of privacy - it is trivial to construct objective functions which probe data from your device.


I just skimmed their secure aggregation paper (linked in the post), and while I'm no expert, I believe they can actually guarantee privacy. At least for the strong version they describe (there's also a weak one which requires trust in the server).

Edit, link to the paper: http://eprint.iacr.org/2017/281


The paper: https://arxiv.org/pdf/1602.05629.pdf

The key algorithmic detail: it seems they have each device perform multiple batch updates to the model, and then average all the multi-batch updates. "That is, each client locally takes one step of gradient descent on the current model using its local data, and the server then takes a weighted average of the resulting models. Once the algorithm is written this way, we can add more computation to each client by iterating the local update. "

They do some sensible things with model initialization to make sure weight update averaging works, and show in practice this way of doing things requires less communication and gets to the goal faster than a more naive approach. It seems like a fairly straighforward idea from the baseline SGD, so the contribution is mostly in actually doing it.


I was told a long time ago that a cluster of raspberry pis would be useless for machine learning due to processing power and I/O constraints.

This paper seems to suggest that this parallelization might actually be feasible. Would you agree?


"Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud."

So I assume this would help with privacy in a sense that you can train model on user data without transmitting it to the server. Is this in any way similar to something Apple calls 'Differential Privacy' [0] ?

"The key idea is to use the powerful processors in modern mobile devices to compute higher quality updates than simple gradient steps."

"Careful scheduling ensures training happens only when the device is idle, plugged in, and on a free wireless connection, so there is no impact on the phone's performance."

It's crazy what the phones of near future will be doing while 'idle'.

------------------------

[0] https://www.wired.com/2016/06/apples-differential-privacy-co...


While I think you can definitely draw some parallels, differential privacy seems more targeted at metric collection. You have to be able to mutate the data in a way that it becomes non-identifying, without corrupting the answer in aggregate. Apple would still do all their training in the cloud.

In contrast, what Google's proposing is more like distributed training. In regular SGD, you'd iterate over a bunch of tiny batches, sequentially through your whole training set. Sounds like Google's saying each device becomes it's own mini-batch, and it beams up the result, and Google will average them all out in a smart way (I didn't read the paper, but this was the gist I got from the article).

Both ideas are in the same spirit, just the implementations are very different.


Differential Privacy is much more than what Apple's PR department says, differentially private SGD is already a thing.


Well, forget Apple for a moment (that was just an example, since parent asked about them specifically): my point was what Google's describing is separate from differential privacy. There's no controlled noise or randomness being applied.

They even say at the end of the paper: "While federated learning offers many practical privacy benefits, providing stronger guarantees via differential privacy, secure multi-party computation, or their combination is an interesting direction for future work." So, the "practical privacy benefits" here is referring to the dimensionality reduction from running the raw data thru the LSTM.


This is different from differential privacy (which, btw, isn't just an apple thing). Differential privacy essentially says some responses will be lies, but that we can still get truthful aggregate information. The canonical example is the following process: Flip a coin, if it's head, tell me whether you're a communist. If it's tails, flip another coin and if that one comes up heads, tell me you're a communist, and if it's tails, tell me you're not.

From one run, you can't tell if any individual is telling the truth, but you can still estimate the number of communists from the aggregate responses.

This is doing local model training, and sending the model updates, instead of the raw data that would usually be used for training.


Chrome used differential privacy far before Apple. See the RAPPOR paper.


Here's google doing this in November 2015:

http://download.tensorflow.org/paper/whitepaper2015.pdf


This is fascinating, and makes a lot of sense. There aren't too many companies in the world that could pull something like this off.. amazing work.

Counterpoint: perhaps they don't need your data if they already have the model that describes you!

If the data is like oil, but the algorithm is like gold.. then they still extract the gold without extracting the oil. You're still giving it away in exchange for the use of their service.

For that matter, run the model in reverse, and while you might not get the exact data... we've seen that machine learning has the ability to generate something that simulates the original input...


Laugh out loud. This was the premise of what we were doing 24/7 last year. I really really doubt we were the only ones doing this. Anytime you have highly valuable data that you can't share specifically but want to share the aggregated ML results of, this is how you do it.


I see. My understanding from the article was that it was a novel approach to do the ML on the device but then efficiently transmit and somehow combine the results together into a larger model.

I'm not an expert on ML (clearly). What format does the model actually take when you share it? Is it raw data (like weights for neurons or something) + the configuration of the algorithms?

I understand deanonymizing data and sharing aggregated results, but I thought actually sharing that data via essentially encoding into an algorithm and then sharing that algorithm is quite different.


Pretty much. It's batch learning; they train small batches on the phones and upload the differences. It looks like they have a fancy way to keep the data compressed, train during ideal times and encrypting the data until it gets averaged into the model.

I'm not sure this is ready for decentralized P2P yet, but I would love to see someone working towards that.


This is quite amazing, beyond the homomorphic privacy implications being executed at scale in production -- they're also finding a way to harness billions of phones to do training on all kinds of data. They don't need to pay for huge data centers when they can get users to do it for them. They also can get data that might otherwise have never left the phone in light of encryption trends.


I'm not understanding this as homomorphic privacy.

They take pains to say:

> your device downloads the current model, improves it by learning from data on your phone, and then summarizes the changes as a small focused update. Only this update to the model is sent to the cloud, using encrypted communication, where it is immediately averaged with other user updates to improve the shared model. All the training data remains on your device, and no individual updates are stored in the cloud.

(my emphasis on the word stored)

Now there are lots of scholarly articles on reverse-engineering and rule-extraction from neural nets.

So Google, having the diff can actually get some idea what it is you are trying to teach the net.

They just promise not to.


>They just promise not to.

Google makes the OS and the keyboard. If they wanted to run a keylogger on every device against the express wish of users they could.

So I think the more important question is if someone else could steal or "legally" request that data from Google and recover my keystrokes.


In the secure aggregation paper (linked in the post: http://eprint.iacr.org/2017/281 ) they indeed hint at the possibility of extracting information from the diffs. So the protocol they propose cryptographically ensures that the individual diffs can not be learned. They do this by having each pair of clients generate a secret the size of the diff, and have one add it to their diff and the other subtract it. The clients then send that result to the server where they are summed, which cancels out all the secrets. The bulk of the protocol then deals with setting up the secrets and dealing with client drop outs, which appears to be the real challenge.

Well, that's my fallible summary anyway, go read the paper. :-)


And having worked at Google, I trust them on that.


They say they have a technique called Secure Aggregation as well so that it can only decrypt the update if many users have sent their updates. I don't know how possible that would be to reverse engineer.

Plus this would seem a computationally expensive way of conducting mass surveillance


This is speculative, but it seems like the privacy aspect is oversold as it may be possible to reverse engineer the input data from the model updates. The point is that the model updates themselves are specific to each user.


Well you obviously can't fully reverse engineer, since whatever model update being sent is far far smaller than the overall data. Now, could you theoretically extract "some" data? Maybe, but it still is strictly better than sending all of the data.


The 'sentiment neuron' two posts over gives an indication of what this could look like. 'Oh, I see you're giving a strong positive update to the porn neuron...'

More generally, there's the notion that a sufficiently complex model encodes the training data. There's been work on extracting training data (or highly deformed versions of it) from a fully trained neural network. We should be under no illusions that networks offer cryptographically​ strong protections to their memories. It's simply not a design goal.


Yep. Basically you just reverse the network and predict the source - if not done right.


No, generally models are higher dimensional than their inputs (AKA deep learning models generally have more parameters than the number of features).


Yes! You can reverse engineer if you're not careful. You have to make sure you have a reasonably sufficient baseline that you're training locally on. Otherwise your delta update is extremely revealing. Even then, you have to cautious because if not done correctly you can extract the delta.


I would not be surprised if the specific contents of what you wrote was not accessible but I would expect that the general theme would be captured. The weights that are updated would correspond to neurons that would then correspond to specific themes/subjects.


This is an amazing development. Google is in a unique position to run this on truly massive scale.

Reading this, I couldn't shake the feeling that I heard all of this somewhere before in a work of fiction.

Then I remembered - here's the relevant clip from "Ex Machina":

https://youtu.be/39MdwJhp4Xc


While a neat architectural improvement, the cynic in me thinks this is a fig leaf for the voracious inhalation of your digital life they're already doing.


Particularly relevant: "Federated Learning allows for smarter models ... all while ensuring privacy". Reading the paper, Google would still receive model updates, so this statement seems based on assumption that you can't learn anything meaningful about me based on those higher level features (which are far reduced in dimensionality from the raw data). I'm curious how they back up that argument.


At least one person is thinking straight. This is basically a blanket license to learn from everything you do, even if you do it in third party apps. Right now those apps are opaque, and Google can't see what you're doing there. As an ad company (which is what they are, first and foremost) this pisses them off pretty bad. They would like to know which brands you buy on Amazon, what you like on FB and Pinterest, who your friends are, etc. And they want to tie it all to your advertising GUID, preferably across devices and into the real world. The info doesn't even have to be in their cloud, as long as they're the only ones with access to it. It's pretty cool, in a way. It also ensures I'll never buy an Android phone.


Even if this only allowed device based training and not privacy advantages it's exciting as a way of compression. Rather than sucking up device upload bandwidth you keep the data local and send the tiny model weight delta!


It would probably train while the phone is charging and upload while using WiFi, so, no problem.


Tangentially related to this - numerai is a crowdsourced hedge fund that uses structure preserving encryption to be able to distribute it's data, while at the same time ensuring that it can be mined.

https://medium.com/numerai/encrypted-data-for-efficient-mark...

Why did they not build something like this ? I'm kind of concerned that my private keyboard data is being distributed without security. The secure aggregation protocol doesn't seem to be doing anything like this.


This is literally non-stochastic gradient descent where the batch update simply comes from a single node and a correlated set of examples. Nothing mind-blowing about it.


For the bits described in https://arxiv.org/abs/1610.02527 you're essentially correct. Though it's still stochastic, and you can have mini-batching on each node.

The interesting technical bits are in https://arxiv.org/abs/1610.05492

To save on update bandwidth, they either restrict the gradient to a lower dimensional space, or compress by quantizing the full gradient (which should effectively add zero-mean noise) before sending it back. (In theory they could do both of these, but they didn't actually test that.)


Just because you shuffle the examples on a single phone/user doesn't make it stochastic.

The entire point of using stochasticity (ie: random shuffling) is to avoid similar and/or a same-ordered run of examples from redirecting the hill climbing in a globally non-optimal direction all at once.

A single user's examples will be very similar, so you can shuffle all the examples from one user you want - that doesn't make it truly stochastic in the context of gradient descent optimization.

The quantization / compression part is pretty cool though. I suppose that could obfuscate slightly what the original example was for privacy purposes? Seems like you'd lose on accuracy though.


Where is the security model in this? What stops malicious attackers from uploading updates that are constructed to destory the model?


No different from a centralized approach?

When the rail company in Sweden first offered voice bookings people would hoax it by calling and saying "I want to book a ticket from Göteborg to Stockholm", and a robotic voice would reply "Do you want to book a ticket from Göteborg to Stockholm?"; at this point the hoaxer says "No, I want to book a ticket from Stockholm to Göteborg". And back and forth for hours and hours until the system could no longer distinguish between Stockholm and Göteborg and the system was basically useless.

A system that tries to learn corrections, e.g. the keyboard example in the article, are all vulnerable to poisoning. As anyone who plays with poisoning Google search results delights in.


I guess with the centralized approach you still have the possibility to detect malicious content. With the decentralized updates it is harder as you need to detect malicious updates.


That's hilarious! Do you have a link to an article about it by any chance?


Heard it first-hand from a dev working on the project. Railway company is called SJ.

My Google-fu is weak.


I suppose that is where the Consensus part comes into play, in which an attack as you describe would simply be an outlier and not included in the model.


To be honest I have thought about this for long for distributed computing. If we have a problem which takes a lot of time to compute but problem can be computed with small pieces and then combined then why can't we pay user to subscribe for the computation? This is a major step toward thr big goal.


I don't work with ML for my day job but find it exhilaratingly interesting. (true story!)

When I first read this I was thinking: surely we can already do distributed learning, isnt that what for example SparkML does?

Is the benefit of this in the outsourcing of training of a large model to a bunch of weak devices?


Just that a few teams at google need more CPU power and cannot get more budge than their peer teams... perhaps even the publication of a paper like that is rewarded internally in the company for the publishers. In spite of encrypted communication between both parts, I am not sure how they will trust clients, especially since recently has been a public call for generate (and now train) faked data in user's client.

Perhaps a statistically random tests to ensure that client's code has not been tampered.

In the other hand no one speaks about energy/battery consumption of the clients (you got 8 cores and a gpu in your pocket right? Finally there is an application a part of videogames which will take profit of them).


I think the implications go even beyond privacy and efficiency. One could estimate each user's contribution to fidelity gains of the model. At least as an average within a batch. I imagine such an attribution to rewarded in money or credibility in the future.


Where is the difference between that and distributed computing? A part of the specific usage for ML I don't see many differences, seti@home was an actual revolution made of actual volunteers (I don't know how many google users will be aware of that).


Huge implications for distributed self-driving car training and improvement.



I had exactly this idea about a year ago!

I know ideas without execution don't worth anything, but I'm just happy to see my vision is on the right direction.


are you working in related area now?


Could we build this into a P2P-like model where there are some supernodes that do the actual aggregation?


I would argue there is no such thing. The model will after the update now incooperate your traning data as a seen example, clever use of optimization would enable you to partly reconstruct the example.


How similar is this to multi-task learning?


Google is building a google cloud, that is they try to use the hardware of other people, instead of other people using Googles hardware.


This appears to be mostly about privacy concerns and on-device performance. Google hardly needs the computational power of even millions of phones versus the behemoths that are their data centers.

(and the phones are only part-time, at that!)


Consider that it's not Google as a whole, it's the keyboard team, who obviously have only a small fraction of Google's compute power.

Also, gboard is installed on between 500 million and 1 billion phones.

https://play.google.com/store/apps/details?id=com.google.and...

So I can't confidently dismiss the computational cost away just yet.


Ah, this is really interesting. And I suppose phones will only get more and more powerful going forward. Thanks to you and nudpiedo for the respectful counter!


I am sorry, but I can't disagree more with you... especially after I worked in such kind of corporations I can tell you that they only care for privacy if it is a selling point in a marketing chart. Big companies are always hungry for newer resources, especially the little teams that need to fight their peers to get more budge.


How is that a bad thing? Would you argue that companies using heavy Javascript frameworks on their websites are "using other people's computers" too?


Yes, they moved the cost of rendering from server to client saving lots of energy.


Yes, as you can see from the battery life savings from running Ghostery.


i wonder whether this can be used as a blockchain proof of work


So it's Google Wave for machine learning?




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: