Learning@home hivemind – train large neural networks across the internet

awalton · on Sept 4, 2020

Lots of talk about fault tolerance, not a lot of talk about trusting peers and preventing them from introducing bad data into your presumably precious model...

So if you're forced to trust all of the peers, how is this better than a cloud? Who out there is training models for purely benevolent reasons (i.e. non-profit seeking) and can trust random nodes? If not for purely benevolent reasons, who out there is going to donate CPU time to training your model, essentially writing you a blank check?

hnjst · on Sept 4, 2020

Shameless plug... http://papers.nips.cc/paper/6617-machine-learning-with-adver...

awalton · on Sept 4, 2020

That's a damned timely shameless plug :). I'll add it to the reading list.

Iv · on Sept 5, 2020

Thanks :-)

miedpo · on Sept 4, 2020

Generally, from what I've read on SETI @ Home, the way these work is they run the same calculations on multiple computers. It's still possible to fool the system but it gets increasingly harder the less portion of computers on the network you own (assuming everybody else has an honest computer)

0-_-0 · on Sept 4, 2020

In the case of neural network training the cost of verifying that a gradient submitted by a peer reduces the cost function should be significantly less than generating that gradient, so you wouldn't even need to burn 2x effort to evade cheaters.

drdeca · on Sept 4, 2020

Is it possible to submit a falsified gradient which still reduces the cost, but less so than the actual gradient would, and such that how the network behaves is manipulated?

Like, say, if one selected some of the images in the batch to use a different label for when computing the gradient, but still using the right label for most of the images in the batch?

0-_-0 · on Sept 4, 2020

subsequent gradient updates would probably wipe out the manipulation

darepublic · on Sept 4, 2020

For trusting peers I thought about adding certain questions into the dataset where you know the answer to and see which nodes give you the right answer and which nodes give you garbage.

amitport · on Sept 4, 2020

You are right. Those concerns are are in the focus of some recent research (and some of it is the topic of my in-progress thesis)

ma2rten · on Sept 4, 2020

The resulting model is shared, so everyone who gives their compute time will benefit from it.

Neural networks automatically error correct so they will be robust to some amount of corruption.

ben_w · on Sept 4, 2020

How well has this been characterised when some participants are actively malicious rather than wrong?

Reason I ask is, you can absolutely bet that states will attempt to cause interesting failure modes in other states’ A.I. — imagine if self-driving cars had a literal blind spot for the fifteen senators most aggressive towards [rolls dice] Agrabah?

imtringued · on Sept 4, 2020

Let people create accounts and introduce a reputation system.

There could be a ranking system. Rank I verifies 100% of submitted jobs. Rank II verifies 50% of submitted jobs. Rank III verifies 25% of submitted jobs. Rank IV verifies 12.5% of submitted jobs. Rank V verifies 6% of submitted jobs.

After 250 jobs you go to Rank I, after 500 to Rank II, after 100 to Rank II and so on...

If you submit a job with incorrect results then you lose your account and all unverified jobs submitted by that account are then verified. If you're a honest person you'll just create a new account, if you're a malicious actor then you just wasted a lot of money on nothing because doing a bait and switch will result in your malicious jobs being discarded.

There is still an opportunity for denial of service by creating lots of reputable accounts and then letting them go malicious all at once. You'll have a large backlog of jobs to verify.

ben_w · on Sept 4, 2020

Sounds like the kind of things goverments would do, although perhaps I’m listening to too many works of fiction?

FeepingCreature · on Sept 4, 2020

Small to moderate groups of friends that all know each other and are excited about deep learning?

edit: Ah, just saw the "what it isn't for" section - apparently not.

zabhi · on Sept 4, 2020

Why would an average user want to participate in such a network? The only reasons I can think of are tied to benevolence and altruism of the participants. We saw this being successful in the SETI project. I doubt paying participants would ever be profitable enough for either the operator (why not rent a few machines over cloud) or the participants (training would need power and cpu).

How about tying the training and consumption of the model together. An internet scale tool with a focused goal, like Alexa/Mycroft for speech and intention recognition, that trains a distributed model while pushing improvements back might be more successful in getting adoption.

0-_-0 · on Sept 4, 2020

Lots of people have machines sitting idle that they could rent for neural network training as long as the money they receive is more than the cost of electricity. Cloud computing is significantly more expensive than just the cost of electricity.

driscoll42 · on Sept 4, 2020

This project will probably be more successful if it can tie into BOINC, which grew out of SETI@Home, and has several hundred thousand computers working on various projects. It'll be a lot easier to get people already donating time/resources to donate to this than starting from the ground up.

But you are right about the why? I admit I personally do Folding@Home most of the time, because EVGA gives you up to $10 to spend at EVGA a month by participating, or Boardgamegeek gives badges and some currency to to spent on their site. Which in EVGAs case, has made it so I basically only buy EVGA products now as I have a bunch of EVGA bucks to spend there.

ianbutler · on Sept 4, 2020

I mean https://minecraftathome.com/minecrafthome/ is a thing that people do so I kind of expect people will do this as well.

And tons of people participate on Folding@home which also uses GPU these days.

Nerada · on Sept 4, 2020

I have no idea how ANNs work, but those GPT-3 numbers make it look like the barrier to better AI is an expense issue (compute/financial), whereas I always just assumed we lacked understanding or some better algorithm.

SequoiaHope · on Sept 4, 2020

“The scaling hypothesis” is the name given to the idea that the existing algorithms might be all we need if we just throw more compute at it. Certainly GPT-3 is a very interesting data point here. However we definitely also need better algorithms. It’s a mix of scaling a new algorithms that will get us to AGI.

ricklamers · on Sept 4, 2020

I think that 'just' scaling todays algorithms is quite a naive approach as it would imply the need of huge amounts of training samples for simple tasks (simple to humans). Given humans tend to need an order of magnitude less samples before being able to generalize I think we need more than just scaled up versions of todays NNs, SVMs, Trees, what-have-you.

simsla · on Sept 4, 2020

To an extent.

But a human isn't trained from scratch. Babies go through huge amounts of unsupervised learning to build up a basic vision and language framework.

Training a neural network to recognize dog pictures is like connecting electrodes to your tongue and trying to do the same. Rudimentary "vision" (very small resolution) has actually been demonstrated this way in human experiments, but you definitely need more than a few examples.

A fairer comparison is: can giant pertained NNs learn to generalize with few examples, and the answer seems to be yes.

Reelin · on Sept 4, 2020

I agree that there are clearly algorithmic improvements remaining to be made. However, a counterpoint to your specific example would be the lottery ticket hypothesis and related weight agnostic neural networks.

https://arxiv.org/abs/1803.03635

https://ai.facebook.com/blog/understanding-the-generalizatio...

https://ai.googleblog.com/2019/08/exploring-weight-agnostic-...

ricklamers · on Sept 4, 2020

The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks (when comparing performance against larger networks). I think everyone agrees that most neural networks are highly overparameterized as successful distillation efforts have shown.

However, this doesn't directly make my point about sample efficiency of today's algorithms compared to humans less valid. Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality). Although the expressiveness is clearly harmed by the reduced parameter count/altered network structure which possibly reduces the ability of the network to perform well for certain tasks.

I think it's important to clearly make a distinction between the required amount of computation and the number of data samples that are necessary when talking about scaling up existing methods. Compute is "cheap", while data isn't.

As a side note, I think the usefulness of the lottery ticket hypothesis is mostly about the ability of random initialization to already give a hint about the quality of the 'prior' that is encoded by the network structure. Useful for less computationally intense architecture search as also suggested by the papers and a paper by Andrew Ng on this topic.

Reelin · on Sept 4, 2020

> The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks

Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.

> Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality).

I'm not so sure about that either. From (https://arxiv.org/abs/2001.08361):

> Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

My second and third links are also important! The second talks about generalizing "winning" tickets across other datasets and optimizers. The third talks about weight agnostic neural networks, which in a nutshell are still capable of more-or-less performing a task even with _randomized_ weights.

Weight agnostic networks have a lot of parallels to wildlife that is capable of certain behaviors required for survival effectively immediately, before there's been a chance for significant learning to take place. This is the counterpoint I was referring to - an equivalent phenomenon could explain (at least partially) why humans require so much less data when learning.

ricklamers · on Sept 4, 2020

> Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.

They state "smaller network, same test accuracy, with similar number of iterations". So it seems the original network size wasn't necessary for best test accuracy, and compute requirement is reduced only because it's a smaller network. Sample efficiency isn't increased according to https://arxiv.org/abs/1803.03635.

Good performance with random weights seems to indicate good 'priors' encoded in the network. Like how convolutional networks encode the prior of translational invariance and hence it being a naturally good performer on image inputs/tasks.

I think the parallel to "wildlife that is capable of certain behaviors ... before there's been a chance for significant learning to take place" is that priors are also part of biological intelligence. I.e. brain structure at birth enabling certain surivival oriented behaviors.

Hence, I'm optimistic about transfer learning which could happen through _both_ better models (priors that generalize well) and pretrained weights (possibly partially pretrained, i.e. just initial feature extraction). Either could potentially provide a better starting point from the 'how many samples are necessary for good performance on a variety of tasks' perspective.

The point is that either way information needs to be added for performance on tasks to increase. Doing that in a task specific way by using today's algorithms and a billion samples doesn't seem like the right approach. Finding algorithms, models or specifically perhaps neural network architectures (including training procedures, regularizers, loss function, weight tying) that generalize across tasks without needing many samples due to their informative priors seems the way forward to me. That's _not_ a naive scaling of today's algorithms to larger and larger training sets. Which was the point I was trying to make.

imtringued · on Sept 4, 2020

I think you are not appreciating the difference between a "commercial" NN and the human brain. NNs usually are designed for specific tasks that are simply a subset of the capability of humans. The human brain is huge and therefore an equivalent NN would also be huge. Instead we have lots of small networks and many of them are even competing and trying to solve the same problem.

You need a lot of samples because you're starting from scratch with each network. If you had one super NN that is equally powerful to a bunch of small networks then you would have a network that can easily generalize because it can use existing data as a starting point. The amount of existing data that is useful to an unknown task grows with the size of the NN.

An NLP NN for English could be combined with an image recognition NN. Since the NLP NN already has a concept for "cars" it only has to associate its already learned definition of "car" with images of cars. If you have separate NNs then you will have to teach the both NNs what a car is twice. With small NNs there will always be some redundancy and that redundancy is a fixed cost.

larrydag · on Sept 4, 2020

That's an interesting hypothesis. Is there objective evidence that humans need an order of magnitude less samples? One anecdote to possibly challenge you is that it takes a toddler several months to walk.

Reelin · on Sept 4, 2020

There's actually a paper that attempts to quantify the scaling characteristics of NLP models from early this year. (https://arxiv.org/abs/2001.08361)

Even if scaling alone ends up solving everything (I doubt it), I'd still feel that very significant improvements ought to be possible from an algorithmic perspective. (I realize that's largely baseless, but for some reason I just can't escape the feeling that current algorithms leave a huge amount of potential on the table.)

aiNohY6g · on Sept 4, 2020

smabie · on Sept 4, 2020

One of my friends has a startup where individuals can sell computer time to the highest bidder. I've told him that I didn't think it was a good idea, but this library could change that. I wonder what the performance overhead is like.

He is focused on the gaming space, but with this, the data science space might make more sense.

Nerada · on Sept 4, 2020

Chessbase actually does something similar, where you can rent other user's computers to run analysis on positions[1]. The users offering up their machines set the price though, as opposed to an auction.

[1] https://en.chessbase.com/post/tutorial-how-does-the-engine-c...

sdenton4 · on Sept 4, 2020

Curious how this deals work moving training data around... If you're dataset is a few gbs moving it around is a good bit of overhead, and a decent chunk of local disk space for the host system. Probably not bad of there's a consistent task, but seems like a big problem if the tasks change often.

justheuristic · on Sept 4, 2020

/* hypothesizing */ If you're using it for NLP, your dataset (token ids) typically weighs much less than intermediate tensors. So, i see two scenarios here:

(1) distribute data chunks as you train using more conventional bittorrent systems (e.g. https://academictorrents.com but internal) (2) since you most likely use raw unlabeled data (e.g. just text), peers can crawl it straight from the web

sdenton4 · on Sept 5, 2020

Yeah, it's probably less of a concern for text tasks, where the data per example is relatively light (though there is a whole internet worth of text data...)

I mostly work with audio, where individual examples are ~2MB, so the dataset sizes get very heavy quickly.

CleanItUpJanny · on Sept 4, 2020

If Facebook had a community-sourced network for training combat drones, you guys would trip over each other to volunteer your computing resources

drusepth · on Sept 4, 2020

I probably would, yeah.