Lots of talk about fault tolerance, not a lot of talk about trusting peers and preventing them from introducing bad data into your presumably precious model...
So if you're forced to trust all of the peers, how is this better than a cloud? Who out there is training models for purely benevolent reasons (i.e. non-profit seeking) and can trust random nodes? If not for purely benevolent reasons, who out there is going to donate CPU time to training your model, essentially writing you a blank check?
Generally, from what I've read on SETI @ Home, the way these work is they run the same calculations on multiple computers. It's still possible to fool the system but it gets increasingly harder the less portion of computers on the network you own (assuming everybody else has an honest computer)
In the case of neural network training the cost of verifying that a gradient submitted by a peer reduces the cost function should be significantly less than generating that gradient, so you wouldn't even need to burn 2x effort to evade cheaters.
Is it possible to submit a falsified gradient which still reduces the cost, but less so than the actual gradient would, and such that how the network behaves is manipulated?
Like, say, if one selected some of the images in the batch to use a different label for when computing the gradient, but still using the right label for most of the images in the batch?
For trusting peers I thought about adding certain questions into the dataset where you know the answer to and see which nodes give you the right answer and which nodes give you garbage.
How well has this been characterised when some participants are actively malicious rather than wrong?
Reason I ask is, you can absolutely bet that states will attempt to cause interesting failure modes in other states’ A.I. — imagine if self-driving cars had a literal blind spot for the fifteen senators most aggressive towards [rolls dice] Agrabah?
Let people create accounts and introduce a reputation system.
There could be a ranking system. Rank I verifies 100% of submitted jobs. Rank II verifies 50% of submitted jobs. Rank III verifies 25% of submitted jobs. Rank IV verifies 12.5% of submitted jobs. Rank V verifies 6% of submitted jobs.
After 250 jobs you go to Rank I, after 500 to Rank II, after 100 to Rank II and so on...
If you submit a job with incorrect results then you lose your account and all unverified jobs submitted by that account are then verified. If you're a honest person you'll just create a new account, if you're a malicious actor then you just wasted a lot of money on nothing because doing a bait and switch will result in your malicious jobs being discarded.
There is still an opportunity for denial of service by creating lots of reputable accounts and then letting them go malicious all at once. You'll have a large backlog of jobs to verify.
Why would an average user want to participate in such a network? The only reasons I can think of are tied to benevolence and altruism of the participants. We saw this being successful in the SETI project. I doubt paying participants would ever be profitable enough for either the operator (why not rent a few machines over cloud) or the participants (training would need power and cpu).
How about tying the training and consumption of the model together. An internet scale tool with a focused goal, like Alexa/Mycroft for speech and intention recognition, that trains a distributed model while pushing improvements back might be more successful in getting adoption.
Lots of people have machines sitting idle that they could rent for neural network training as long as the money they receive is more than the cost of electricity. Cloud computing is significantly more expensive than just the cost of electricity.
This project will probably be more successful if it can tie into BOINC, which grew out of SETI@Home, and has several hundred thousand computers working on various projects. It'll be a lot easier to get people already donating time/resources to donate to this than starting from the ground up.
But you are right about the why? I admit I personally do Folding@Home most of the time, because EVGA gives you up to $10 to spend at EVGA a month by participating, or Boardgamegeek gives badges and some currency to to spent on their site. Which in EVGAs case, has made it so I basically only buy EVGA products now as I have a bunch of EVGA bucks to spend there.
I have no idea how ANNs work, but those GPT-3 numbers make it look like the barrier to better AI is an expense issue (compute/financial), whereas I always just assumed we lacked understanding or some better algorithm.
“The scaling hypothesis” is the name given to the idea that the existing algorithms might be all we need if we just throw more compute at it. Certainly GPT-3 is a very interesting data point here. However we definitely also need better algorithms. It’s a mix of scaling a new algorithms that will get us to AGI.
I think that 'just' scaling todays algorithms is quite a naive approach as it would imply the need of huge amounts of training samples for simple tasks (simple to humans). Given humans tend to need an order of magnitude less samples before being able to generalize I think we need more than just scaled up versions of todays NNs, SVMs, Trees, what-have-you.
But a human isn't trained from scratch. Babies go through huge amounts of unsupervised learning to build up a basic vision and language framework.
Training a neural network to recognize dog pictures is like connecting electrodes to your tongue and trying to do the same. Rudimentary "vision" (very small resolution) has actually been demonstrated this way in human experiments, but you definitely need more than a few examples.
A fairer comparison is: can giant pertained NNs learn to generalize with few examples, and the answer seems to be yes.
I agree that there are clearly algorithmic improvements remaining to be made. However, a counterpoint to your specific example would be the lottery ticket hypothesis and related weight agnostic neural networks.
The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks (when comparing performance against larger networks). I think everyone agrees that most neural networks are highly overparameterized as successful distillation efforts have shown.
However, this doesn't directly make my point about sample efficiency of today's algorithms compared to humans less valid. Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality). Although the expressiveness is clearly harmed by the reduced parameter count/altered network structure which possibly reduces the ability of the network to perform well for certain tasks.
I think it's important to clearly make a distinction between the required amount of computation and the number of data samples that are necessary when talking about scaling up existing methods. Compute is "cheap", while data isn't.
As a side note, I think the usefulness of the lottery ticket hypothesis is mostly about the ability of random initialization to already give a hint about the quality of the 'prior' that is encoded by the network structure. Useful for less computationally intense architecture search as also suggested by the papers and a paper by Andrew Ng on this topic.
> The way I interpret the lottery ticket hypothesis is that you don't actually need the full sized networks (with their structure and parameters) in order to perform well at some tasks
Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.
> Although what I'll give you is that with smaller networks the required sample size is expected to shrink (due to curse of dimensionality).
> Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
My second and third links are also important! The second talks about generalizing "winning" tickets across other datasets and optimizers. The third talks about weight agnostic neural networks, which in a nutshell are still capable of more-or-less performing a task even with _randomized_ weights.
Weight agnostic networks have a lot of parallels to wildlife that is capable of certain behaviors required for survival effectively immediately, before there's been a chance for significant learning to take place. This is the counterpoint I was referring to - an equivalent phenomenon could explain (at least partially) why humans require so much less data when learning.
> Actually that's not the point. Pruning typically results in networks that still perform well but are harder to train. The idea is to explicitly search for a subnetwork (via pruning) that is easy to train.
They state "smaller network, same test accuracy, with similar number of iterations". So it seems the original network size wasn't necessary for best test accuracy, and compute requirement is reduced only because it's a smaller network. Sample efficiency isn't increased according to https://arxiv.org/abs/1803.03635.
Good performance with random weights seems to indicate good 'priors' encoded in the network. Like how convolutional networks encode the prior of translational invariance and hence it being a naturally good performer on image inputs/tasks.
I think the parallel to "wildlife that is capable of certain behaviors ... before there's been a chance for significant learning to take place" is that priors are also part of biological intelligence. I.e. brain structure at birth enabling certain surivival oriented behaviors.
Hence, I'm optimistic about transfer learning which could happen through _both_ better models (priors that generalize well) and pretrained weights (possibly partially pretrained, i.e. just initial feature extraction). Either could potentially provide a better starting point from the 'how many samples are necessary for good performance on a variety of tasks' perspective.
The point is that either way information needs to be added for performance on tasks to increase. Doing that in a task specific way by using today's algorithms and a billion samples doesn't seem like the right approach. Finding algorithms, models or specifically perhaps neural network architectures (including training procedures, regularizers, loss function, weight tying) that generalize across tasks without needing many samples due to their informative priors seems the way forward to me. That's _not_ a naive scaling of today's algorithms to larger and larger training sets. Which was the point I was trying to make.
I think you are not appreciating the difference between a "commercial" NN and the human brain. NNs usually are designed for specific tasks that are simply a subset of the capability of humans. The human brain is huge and therefore an equivalent NN would also be huge. Instead we have lots of small networks and many of them are even competing and trying to solve the same problem.
You need a lot of samples because you're starting from scratch with each network. If you had one super NN that is equally powerful to a bunch of small networks then you would have a network that can easily generalize because it can use existing data as a starting point. The amount of existing data that is useful to an unknown task grows with the size of the NN.
An NLP NN for English could be combined with an image recognition NN. Since the NLP NN already has a concept for "cars" it only has to associate its already learned definition of "car" with images of cars. If you have separate NNs then you will have to teach the both NNs what a car is twice. With small NNs there will always be some redundancy and that redundancy is a fixed cost.
That's an interesting hypothesis. Is there objective evidence that humans need an order of magnitude less samples? One anecdote to possibly challenge you is that it takes a toddler several months to walk.
There's actually a paper that attempts to quantify the scaling characteristics of NLP models from early this year. (https://arxiv.org/abs/2001.08361)
Even if scaling alone ends up solving everything (I doubt it), I'd still feel that very significant improvements ought to be possible from an algorithmic perspective. (I realize that's largely baseless, but for some reason I just can't escape the feeling that current algorithms leave a huge amount of potential on the table.)
One of my friends has a startup where individuals can sell computer time to the highest bidder. I've told him that I didn't think it was a good idea, but this library could change that. I wonder what the performance overhead is like.
He is focused on the gaming space, but with this, the data science space might make more sense.
Chessbase actually does something similar, where you can rent other user's computers to run analysis on positions[1]. The users offering up their machines set the price though, as opposed to an auction.
Curious how this deals work moving training data around... If you're dataset is a few gbs moving it around is a good bit of overhead, and a decent chunk of local disk space for the host system. Probably not bad of there's a consistent task, but seems like a big problem if the tasks change often.
/* hypothesizing */
If you're using it for NLP, your dataset (token ids) typically weighs much less than intermediate tensors. So, i see two scenarios here:
(1) distribute data chunks as you train using more conventional bittorrent systems (e.g. https://academictorrents.com but internal)
(2) since you most likely use raw unlabeled data (e.g. just text), peers can crawl it straight from the web
Yeah, it's probably less of a concern for text tasks, where the data per example is relatively light (though there is a whole internet worth of text data...)
I mostly work with audio, where individual examples are ~2MB, so the dataset sizes get very heavy quickly.
So if you're forced to trust all of the peers, how is this better than a cloud? Who out there is training models for purely benevolent reasons (i.e. non-profit seeking) and can trust random nodes? If not for purely benevolent reasons, who out there is going to donate CPU time to training your model, essentially writing you a blank check?