Hacker News new | past | comments | ask | show | jobs | submit login
AI training method exceeds GPT-3 performance with fewer parameters (infoq.com)
247 points by kordlessagain on Oct 7, 2020 | hide | past | favorite | 81 comments



GPT3's strength is on language generation, so using *GLUE for evaluating it (where encoder type models are just better) and claiming to have 99.9% less parameters is sensationalism.


That's kind of like saying that this super cheap SUV from 2003 is a better off-roader than the latest Ferrari. Like, true statement, but vacuous none the less.


Powerful analogy, but analogies are dangerous. They can obscure what's really happening.

In this case, by analogy, Ferrari made the comparison to the super cheap SUV from 2003. That is, OpenAI compared GPT3 to BERT on the SuperGLUE benchmark, in the paper announcing GPT3 [1].

They did so to demonstrate GPT3's ability to learn a new task given only a few examples of the task ("few-shot learning"). The limited amount of task-specific training data was a signficant handicap that GPT3 was able to overcome, like a Ferrari towing a two ton trailer outperforming an old SUV towing nothing.

What this paper claims is that encoder type models can also achieve few-shot learning. The headline should be "AI training method achieves few-shot learning with 99.9% fewer parameters than GPT3." That's the innovation here, not outperforming GPT3 on a benchmark that GPT3 isn't particularly good at.

1: https://arxiv.org/pdf/2005.14165.pdf


Unfortunately, your correction is still misleading because it fails to capture what really sets GPT-3 apart. Ironically, GPT-3's limitation also highlights its strength. As it is not capable of learning (few shot or otherwise) in the strict sense of permanently changing its parameters based on examples, all its demonstrated capabilities are completely at inference time. It somehow configures itself at inference time so that state machines which produce plausible continuations of whatever pattern it was fed, are most probably generated. This means that whenever it succeeds, it is much more flexible in how it produces its responses. It generalizes on and continues those implicit patterns in the provided input.

This paper however, is not replicating that flexibility. Their proposed model is much closer to expectation maximization than it is an instance of what GPT3 does. The novelty of their work, what makes it genuinely useful, is they provide a practical and fairly general way to leverage pre-trained language models to produce classifiers for specific tasks using a very small amount of labeled data. Requiring less effort compared to what would go into fine-tuning. This approach to distillation is an instance of https://en.wikipedia.org/wiki/Semi-supervised_learning.

Compared to GPT-3, this approach remains at a severe disadvantage when amount of effort and time required to gather data and train something useful is accounted for. On the other hand, if you can fit your problem into the format proposed by the paper, you will likely have more control on the final model's behavior using a small amount of labeled examples (for your specific task), at a significantly lower cost of computation at inference time. Focusing on parameters however, is not even wrong.


Number of parameters remains an important practical concern, and an important goal for research (see [1] for a recent example).

It is true that this model requires a large amount of unlabeled data, in addition to a small amount of labeled data, but gathering unlabeled data is often easy.

So I think it's potentially misleading to say "this approach remains at a severe disadvantage when amount of effort and time required to gather data and train something useful is accounted for".

I'd say this approach actually has a serious advantage compared to GPT3, which is locked inside the walls of OpenAI, can only be used with their permission, and is too big for most people to use (let alone train) anyway. The cost and effort to use this approach on large real world problems is probably less than using GPT3.

And it may also have an advantage in terms of the total amount of data and training required, when all of the data and training in the original pretrained models is included. That is, it may be a more efficient way to train new models from scratch for tasks for which few labeled examples are available.

1: https://ai.googleblog.com/2020/09/advancing-nlp-with-efficie...


Let me preface this by saying I generally agree with what you're saying. My focus below is on the delta.

> It is true that this model requires a large amount of unlabeled data, in addition to a small amount of labeled data, but gathering unlabeled data is often easy.

Gathering unlabeled data is easier than labeled but can still be challenging. You'll often require careful thought in assembling a distribution of examples. Being able to skip that step yields a significant savings even if not as much as that gained from going from labeled to largely unlabeled data.

> I'd say this approach actually has a serious advantage compared to GPT3, which is locked inside the walls of OpenAI, can only be used with their permission, and is too big for most people to use (let alone train) anyway.

I fully agree and said as much too.

> The cost and effort to use this approach on large real world problems is probably less than using GPT3.

I'd say it depends. Most of the effort with GPT3 will involve edge cases. Having a system in front to handle these might eat into labor savings but you could still end up net positive. It's difficult to say without real world data, you might be correct.

> That is, it may be a more efficient way to train new models from scratch for tasks for which few labeled examples are available.

You're right in general, I think. But it's still worth pointing out GPT3's advantage. It combines a lot of general capabilities, which together with its generative ability and flexibility to input means the level of expertise required to get something useful will be much lower compared to this semi-supervised learning approach. And there are some capabilities it's displayed, one example of many being discussing, querying, pattern matching on computer code that seem hard to replicate with this method.


> It somehow configures itself at inference time so that state machines which produce plausible continuations of whatever pattern it was fed, are most probably generated.

What is this sentence supposed to convey? I'm an NLP practicioner/researcher and this isn't even true - as GPT isn't a "state machine" as the latent space is continuous and not finite.

Moreover, there is nothing that makes GPT-3 "not capable of learning." It has had very exciting results from language modeling a zero-shot task at inference time, but there's nothing (besides compute) precluding fine-tuning of it in principle.

I agree with the rest of your comment.


> What is this sentence supposed to convey?

There are examples where it is able to recognize and continue patterns in strings which if manually generated, would have required a FSM. In fact, some of the more impressive examples would require a stack of some sort so I thought I was rather underselling its capabilities in that arena.

> as GPT isn't a "state machine" as the latent space is continuous and not finite.

Technically speaking, that is impossible since these models leverage floating point numbers and are limited in memory to whatever hidden and self-attention layers.

Practically speaking, in order to generate strings based on patterns as mentioned prior, there must be abstract states which correspond to states and state changes such that thinking in terms of at least state machines is useful.

Studying LMs in terms of automata is not strange, there have been papers which do this for specific trained RNNs (such as https://arxiv.org/abs/1711.09576). I contend GPT-3 is capable, to a certain extent, of generating these dynamically at inference time.

As far as I know this way of extracting what LMs are doing hasn't been done for Transformers but you can also frame Transformers in term of RNNs so there's no reason why such methods wouldn't readily apply to them too.

> Moreover, there is nothing that makes GPT-3 "not capable of learning."

I specifically addressed that: to count as learning, without diluting the utility of the term, it has to be capable of remembering. Without permanent changes to its weights, the use of the term learning stretches the word beyond utility.


> Without permanent changes to its weights, the use of the term learning stretches the word beyond utility.

Yes, my claim is that there is nothing that makes its weights incapable of being fine-tuned and thus changed.


Oh okay, then in that case we don't disagree and are talking about different things. My issue is the few shot learning done by a model which is performing gradient updates should be distinguished from whatever GPT-3 is doing when it's continuing context patterns.


Oh I love me a great analogy!!


Welcome to benchmarklandia, where you find the top algorithm in an area and write a stripped down version that outcompetes it on a different set of data and get good results.

See: hadoop->spark->spark with infiniband->spark with nvme->timely dataflow on a laptop; csv parsers that don't support scientific notation floats, etc. I'm sure others can give moren interesting examples.


Out of (self-interested) curiosity, what are you referring to with the "nvme->timely dataflow on a laptop" reference?

The closest benchmark I could find was the "FASTER State Management for Timely Dataflow" paper from ETH Zurich, but that wasn't run on a laptop.



It's a bit different, right, because that paper presents better algorithms on the same datasets as the work it cites.


Yes it's different. Another part of Benchmarklandia is where researchers make custom algorithms for the same dataset. Benchmarks targeting MNIST is a great example. :)


This isn't strictly true. This is comparing to "GPT3 as a few-shot learner"[1] as opposed to the fine tuned models that everyone else use.

Few-shot GPT3 outperforms a BERT-based baseline.

[1] https://github.com/openai/gpt-3


Yeah, but they actually do fine-tune their model and compare it to in-batch GPT3 "learning". I called them out on it on Reddit and they claim that they use the same amount of data. I.e. GPT use X examples in the sample and they use the same amount of X samples to fine-tune. However, I am still not convinced that it is a fair comparison.


Yes, it can do simple yes/no questions but can it write dad jokes like GPT-3?


For me another question is crucial: is it open (i.e. is there an open source reference implementation) or closed like the so-called "OpenAI" products/services?


You can find the code for the paper here: https://github.com/timoschick/pet


Can you please explain to me (I don't understand it) why do you think GPT-3 is closed? Yes, they won't share the trained model, but they're sharing the research here[0][1] so you can reproduce easily, aren't they? As I understand it now, it's very fair - training the model is a separate thing from doing (and sharing) the research, is very costly, and would not happen if they were forced to open that too - I also don't understand why should they be.

[0] https://arxiv.org/abs/2005.14165

[1] https://github.com/openai/gpt-3


I submitted a request for access to the GPT-3 API a couple months ago and still haven't been approved.

I don't find that to be very open at all.


The API does not have to be open, the research does. This is like saying cloud services should be free and open for all because Linux is open.


For quite some time maddog has been promoting Subutai, a kind of P2P cloud (https://subutai.io/). Voluntary computing is an old concept, even BOINC is almost 20 years old. Peertube already reached the stage where it's actually usable and people can watch the movies smoothly. So it's not unimaginable that people who care organize somehow creating a platform where you could use a GPT-3 platform by contributing your GPU time in exchange.


To expand on that point, they're basically selling pre-executed computing to you. Resolving the query itself is nothing.


> so you can reproduce easily

That's like saying you can look at the Eiffel tower and it's schematics so what's so hard about getting a spare million dollars and building it.


Well, we others can buy a ticket and go see it. Same with GPT-3.


Yes they tell you how to do it.

But training a model of this size requires you to use thousands of GPUs, or wait forever. That will sum up to millions of $$$ in rental and electricity costs.


Looks like a great opportunity for someone to step in and organize a project aimed at recreating that. I for one would be more than glad to donate my GPUs' time for something that would be useful for all humanity, for free.


That seems like the reason why won't they share the trained model. Somebody needs to pay for that, how would they fund it if they just shared it?

As far as my thinking goes, they are open more than enough. Thank you for your input!


They claim to be a well funded nonprofit.


Yeah, and that nonprofit probably wouldn't really be doing its mission it got funding for if they were handing out computing, instead of research


Well, they are withholding the information needed to verify their research. That is generally frowned upon in science.

Plus, they are called OpenAI but producing a closed source product...


It's not a product, it all boils down to raw computing work. Their mission is to do open research, not provide computing time that they paid for for free. Similarly, Red Hat does not have to provide free cloud computing because they also contribute to Linux and use it.


This is a very article on the subject of GPT3's 'market position' - https://bdtechtalks.com/2020/09/21/gpt-3-economy-business-mo...


I had this discussion with Timo Schick, the first author, on Reddit. My final comment is copied below, with the relevant context.

Note that GPT-3's approach to *GLUE involved no training on the task, just a good choice of prompt, whereas PET and iPET also use fine-tuning. Also, because distillation takes large amounts of training data, they use ensembles in the true few-shot regime, so their parameter efficiency is significantly worse than they advertise.

https://www.reddit.com/r/slatestarcodex/comments/itrcac/smal...

---

Timo Schick:

Finally, I do not really agree with your last two paragraphs, especially "One is about semi-supervised learning, that says by exploiting task-specific architectures you can do fairly well with low amounts of labelled data.": If you leave out the final distillation step (which is not required for good performance), we use the exact same architecture for all tasks. In what sense is this more task-specific than GPT-3? I would not consider "exploiting task-specific architectures" to be a (fundamental) part of the paper.

My reply:

So what I mean here is that masked training and bidirectional transformer models like BERT have always been designed as a way to get good scores in analysis tasks like Q&A, even if they are pretrained on general text, whereas unidirectional generative transformer models are now basically only relevant for generative tasks. You can say, well, both architectures can do both tasks, so is it really task specific?, but ultimately, yes, we've selected ALBERT because it's better for Q&A tasks, and we've selected unidirectional transformers in other things because they're better for generative tasks.

So I guess the problem I have is with the merits of your thesis, “Can we achieve similar few-shot performance to GPT-3 without requiring billions of parameters?” OpenAI didn't present few-shot learning as if it were an optimal method; their headline achievement was not “here's the best way to...” but “I bet you never expected that this could...”. And so while it's definitely true that a BERT-derived model will outperform a GPT-derived model even at lower parameter counts on these sort of tasks, nothing new or interesting is being said by it. Everyone already knows that a bidirectional GPT-3 would be better at Q&A, and so that's what a smaller bidirectional model should be competing against. GPT-3 is only interesting in this context because it's not the optimal model (or training routine).

So while it's also true that if your aim is SOTA in few-shot learning then you should definitely use a bidirectional transformer with all the new tricks, if your goal is to understand PET in a context that includes GPT-3, doing so merely makes it harder to see what's going on.


A shorter reply would be: It would be great to compare PET not only to GPT-3, but also to other models, especially ones geared towards few-shot learning.

Do you know of any other models that should be used for such a comparison, or are there already any relevant results on SuperGLUE that should be mentioned?


This appears to be SOTA on SuperGLUE with few-shot learning.

PET (well, a version called iPET from the same author) is at #9 on the SuperGLUE leaderboard [1], and none of the models above it mention being evaluated by few-shot learning.

1: https://super.gluebenchmark.com/leaderboard/


The results reported there are what most people would call ‘semi-supervised learning’, not ‘few-shot’. The true few-shot results are in a few places in the paper, https://arxiv.org/abs/2009.07118, labeled with ‘- dist’.


There are many BERT-based models that would have made for a good numeric comparison, had they tested on few-shot learning, but I'm not aware of any that have.


Well, in table 1 they compare to RoBERTa trained in a vanilla supervised fashion?


We should not be surprised to see new language models achieving GPT-3 performance with fewer parameters. The purpose (I assume, as an outsider) of the GPT-* project is to try to find the upper bound of how good a language model can be, without caring too much about efficiency. GPT-3 essentially scaled up the known good architecture of GPT-2.

If you have a really big compute cluster, it makes sense to do experiments like this. It would be foolish to constantly try new methods without occasionally checking to see how far you can push current methods.

A similar thing happened with VGGnet in image classification. It achieved SoTA with a huge amount of parameters, using the standard techniques of the time, but increasing the network depth. Later, people discovered a lot of tricks to get similar accuracy with fewer parameters.


> A similar thing happened with VGGnet in image classification

I know I am at risk of a "Nobody will ever need more than 640KB of memory" comment, but the model size has exploded far in excess of our ability to improve GPU cards.

GPT-3 is two orders of magnitude larger than VGGnet-16. Back when the VGG paper was published a Titan Z (2014 gen) had 12gb of RAM, while a Titan RTX (current/previous gen) has just doubled to 24gb.


We can now each the performance of original VGG16/VGG19 with models that are much smaller. On ImageNet MobileNetsV2 matches the performance with 1/30 the size of weights, and same order of magnitude reduction in inference time and RAM usage. The same will probably happen with GPT-3.

Of course we will probably also get models that surpass GPT-3 at the cost of even larger and more expensive models. The two are not at all exclusive.


Your point is still valid but yesterday I learned about the newest workstation cards from nvidia - A6000 series with 48GB

https://www.engadget.com/nvidia-rtx-a6000-a40-gpu-profession...


If you want to read the source, I recommend skipping to their follow-up paper (same authors): https://arxiv.org/abs/2009.07118

Edit: tangentially related but for those who like to have a glance from their phone, arxiv-vanity is great instead of squinting at a pdf: https://www.arxiv-vanity.com/papers/2009.07118/


What is wrong with opening the PDF on a phone?

Arxiv vanity has issues. Figure 2 in the paper you listed doesn't display correctly. I am not sure the author would agree or be happy that arxiv vanity reproduces their work into a webpage with sub-par rendering.


Nice. We're building a machine learning platform[0] where you can schedule a training notebook job, watch it on your phone. Examples[1][2]. Docs[3]

- [0]: https://iko.ai

- [1]: https://twitter.com/jugurthahadjar/status/130762501783286989...

- [2]: https://twitter.com/jugurthahadjar/status/131380072874244915...

- [3]: https://iko.ai/docs/notebook/#long-running-notebooks


Just to clarify. This is about the fact it's cool to be able to read/consult what usually is better suited for a larger monitor, such as content from arxiv, or ML workloads. The links I shared are relevant because of that.


The headline is clickbait (GPT-3 is about language modeling/generation, and there's ZERO mention of that in this work), but the actual work is interesting and the paper is worth a read.

To the OP: please consider changing the headline to "Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference" and link to the original paper at https://arxiv.org/abs/2001.07676 instead of this PR puff piece.


No one except people doing work on Cloze questions will be interested in this with that title. Badly written PR puff piece is bad but no one will read that abstract unless they are in that field.


Exactly. "We made GPT-3 with 99.9% less overhead" is a titillating and pants-wetting event. This is not.


> but no one will read that abstract unless they are in that field

Which is fine?

What's the point of getting people outside of the field reading a paper by essentially lying about the content? They come expecting A, and if they read it they understand it's actually about B. Loss of time for everyone except the author who wants to make a little buzz.


I think few-shot learning (or priming) is actually the main selling point of GPT-3 for most practical applications (rather than merely entertaining language generation). So if there is a method that achieves the same goal with a model that is simple enough to be used by normal developers and researchers without OpenAI-scale infrastructure, that does seem buzz-worthy.


> So if there is a method that achieves the same goal with a model that is simple enough to be used by normal developers and researchers without OpenAI-scale infrastructure, that does seem buzz-worthy.

That's trivial and has already been done. GPT-3 didn't even get SOTA on SuperGlue.

These are the sort of misunderstandings that could have been avoided if the title was better.

In general, paper with "new variation of cloze pre-training task for this specific task" is a new section of the literature that is rapidly becoming sort of mundane and uninteresting because there are so many papers doing small variations of the same basic idea.


> GPT-3 didn't even get SOTA on SuperGLUE.

Of course neither GPT-3 nor the PET paper claim SOTA on SuperGLUE. They used a few-shot learning setup with 32 examples per task The normal SuperGLUE setup has hundreds or thousands of examples per task [1].

> In general, paper with "new variation of cloze pre-training task for this specific task" is a new section of the literature that is rapidly becoming sort of mundane and uninteresting because there are so many papers doing small variations of the same basic idea.

Could you please link to some of the work you are referring to?

[1] Table 1 in https://w4ngatang.github.io/static/papers/superglue.pdf


I was sloppy in my skimming of the paper - upon closer read it does actually seem quite different than that literature I mentioned (examples: RoBERTa, XLNet). I'll be reading it more carefully, but can now better understand the comparison to GPT-3.


I'm just saying that there is a difference between a PR fluff piece that lies about the content and a PR fluff piece that tells the truth and makes research accessible or interesting to the lay public.


Am I the only one who finds it strange that a preprint gets that kind of news coverage and the authors of the preprint are the only ones who are interviewed? Even if it wasn't a preprint, I suppose it's good practice to ask independent experts for a second and third opinion. (Edit: a problem I see with such news coverage is that it really encourages researchers to rush their work out and make headline-grabbing claims even before they have any idea of what the scientific community thinks about the work.)


> although [the new technique] produced better results for NLP benchmarks, GPT-3 appeared more flexible.


Slightly OT: Will GPT-3 itself will have any impact IRL? I ask because of its lacking options/competition on the hosting side (there's just Azure). This implies high prices hindering most use cases to break even/be profitable.


Mostly more effective spam. I see it as having access to the text generation capability of an infinite number of 15 year olds.


I think it's already good enough to generate marketing copy (at a glance, not my area of expertise).

Currently, AI is still very energy intensive though so it won't set the world on fire just yet - i.e. for many tasks the human brain still reigns king and it runs on something like 20W.


This title is somewhat misleading, it’s more that you can use semi supervised learning to improve your fine tuning on a small model that’s already been pretrained.

I think it’s impressive that this technique beats gpt-3 in any setting with such a small model, but for example, what stops you from applying this to gpt-3 itself and getting an equivalent gain?

Saying that it exceeds gpt-3 performance sounds like it does better during pretraining, which would be extremely impressive.


Sure they outperformed GTP-3... but can they DDoS my twitter account via a bunch of software engineers that won't stop talking about their model???


... even more, is the new model "so goddamn powerful" that it can't even be trusted with normal humans, and must be doled out (for money) a little bit at a time so that some normie doesn't accidentally create skynet?


It is starting to become a meme, but it sure achieved its PR goal?


I'm not surprised, a space with 175B parameters will be very sparse.


It's not obvious to me that this is the case. Chris Olah and others talk about "superposition" as a mechanism to explain "polysemantic" neurons that arise in image classifiers. To me, that suggests (using very vague, hand-wavy terms) that the optimization process is attempting to pack in as many concepts as possible into the finite parameter space. Certainly the scaling of GPT-3 suggests that these larger models are not necessarily any sparser than smaller ones.

[1]: https://distill.pub/2020/circuits/zoom-in/


That is fascinating, thanks for the link.


What a weird article.

Semi-supervised training is always interesting, but leading with fact that it outperforms GPT-3 on superGLUE in the few shot setting where GPT-3 isn't training/fine-tuning but PET is.

To make it clear: they finetune PET using synthetic data and then compare the results of this to GPT-3 where GPT-3 is initalized to prepare for querying.

I would like to see how much overlap the synthetic data has with the queries used.


Well if an end task is constrained to one thing, then sure does a model with special tricks to optimize that goal performs better. GPT-3 is to confirm how deep and wide a language model can be to learn general task with few number of examples.


I just wish I had the vocabulary to truly appreciate what's going on here.


Don’t worry, one day soon you’ll be able to ask an AI agent to explain to you what’s going on, in simple terms.


GPT-3 is also being used to make hedge funds money by generating investing themes https://medium.com/@492727ZED/dexamethasone-announcement-cou...


Just don't allow it to make itself better or we'll be in trouble....


You do better than gpt3 with fewer parameters if you can have the same or better performances over the amount of tasks gpt3 can do well. To test a general model like gpt3 against something that can address a single problem does not make sense.


It does make sense if your aim is to address that single problem. They're comparing against GPT-3 because that was the previous record holder (in the few-shot setting), despite being a more general model.


Good, "Open-AI" can suck it, the pricks.


NLP is not "natural language processing" as mentioned in the article, it's "neuro linguistic programming".




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: