This is a surprisingly unintelligent move from OpenAI. It adds corporate inertia to something as mundane as choice of DL framework.
Imagine you worked at OpenAI. Imagine you wanted to experiment with Jax, and that it turned out to be the best solution for the problem. Now you can't ship without a solid technical justification.
Except, it's not really a technical justification that you need. You need corporate clout. You can't just be a junior engineer and make a decision that goes against corporate policy. That's the point of having a corporate policy.
I can hear a thousand people about to type "C'mon, OpenAI isn't a normal corporation." But it is. Every corporation is a normal corporation. And having policies against specific tech should make productive programmers pause.
People get jobs at companies based on whether they use React or Vue, for example. And in DL, a programming library is basically a programming language, so it's one step more powerful than that.
Here's an example. Pytorch, as far as I can tell, doesn't support running code on a TPU's CPU. (I could be wrong about this!) When you enumerate the list of accelerators available after connecting to a TPU, you get a list of 8 entries. That means they only support executing code on the cores of a TPU, not the TPU's CPU. This is a huge difference. It means you're restricted to 8GB on TPUv2-8's (which you get on Colab) instead of 300GB.
Does that count as a solid technical justification to use Tensorflow for a research project instead of Pytorch? Who knows. But who wants to be the odd one out on corporate politics? Especially if a project doesn't generate any tangible results, which is often the case for research.
Or they see this problem and that's why the policy is sanely phrased as follows:
Going forward we’ll primarily use PyTorch as our
deep learning framework but sometimes use other
ones when there’s a specific technical reason
to do so.
It never works out this way in practice. You need corporate clout to go against corporate policy. That's the point of having a corporate policy.
Of course they added that caveat. That's probably how this idea got through in the first place. Just point at the caveat and say "But we're not really throwing all the other frameworks under the bus. If everyone decides it's a good idea to use something else, we'll use something else."
Except that likely won't happen, because now as a junior engineer you need to convince N other people that using Jax was a decent choice. And it's against your company's culture to use anything but Pytorch.
This battle of Tensorflow vs Pytorch is bad for everybody involved. OpenAI released a lot of cool and important code related to Tensorflow. They did GPT-2 (tensorflow 1.x), blocksparse (also tensorflow), memory saving gradients (tensorflow 1.x), and now they're announcing they'll likely never be releasing such tooling again. Memory saving gradients have been hugely helpful to us for scaling our models beyond the normal limits.
What you’re ignoring is that the switch isn’t from nothing to Pytorch, it’s from Tensorflow to Pytorch. It’s only favoring one library over another. Your scenario with Jax hasn’t changed, and such tooling is going to be released for Pytorch instead of for Tensorflow. I suspect you’re only against this because you prefer Tensorflow to Pytorch.
OpenAI always had an "official" framework - it just used to be Tensorflow. That's why many of its public packages (baselines, spinningup, blocksparse, etc.) were built with Tensorflow.
OpenAI has researchers, and it has people who work on infrastructure for the researchers. Researchers are free to use whatever they want, but if the infrastructure developers want to build something for the researchers, it's beneficial to have a "standard".
OpenAI researchers will still be able to use other frameworks for their own research. All this means is that their major infrastructural projects will be released in PyTorch.
My point is that we probably won't be seeing more awesome projects from OpenAI written in Tensorflow. And that's unfortunate. Memory saving gradients were particularly helpful.
1. GPT hasn't really been about model/architectural experimentation, just scale. GPT-2 and GPT were architecturally very similar. Scale, especially at the scale of GPT-*, is one avenue that TensorFlow does have an edge over PyTorch
2. Work on GPT-3 probably started quite a while ago.
AFAIK, the problems with running Pytorch on TPUs have mostly been ironed out.
Also, this move makes a lot of sense for OpenAI. TF is a nightmare of different modules kludged on top of one another, many of which do the same thing. The API has changed so much even in a few years that code from previous versions won't run without -- in some cases -- significant modification. Finally, it's always been horrible to debug, since it obfuscates the actual workings of the network behind a sess.run().
Pytorch is not only a far more productive language (by virtue of the fact that it's far easier to debug), it also has a better ecosystem now because old code still runs. For students, it's also far easier to look at a Pytorch implementation and figure out what the author has actually done. If it's a choice between getting your hands dirty with low-level TF, bending Keras to your will, or putting something together in Pytorch, the latter is just the better choice. It works on TPUs, and it has Tensorboard, a C++ API for robotics, and (afaik) recently developed deployment tools.
The cost of switching from TF to Pytorch is vastly outweighed by the loss of inertia that OpenAI will experience if they don't, simply because everyone else is using a toolkit that they don't support.
I agree with this to some extent, but there are real advantages to having all your code in same framework, and PyTorch is significantly easier to iterate on and debug compared to TensorFlow (in my experience). Hopefully PyTorch will start offering better support for tensor processors like google’s TPUs, but from the sound of it, OpenAI is primarily using Azure for their training infrastructure and I don’t think Microsoft currently offers anything except GPUs.
Why is TPU support important to openAI? They run their code on Microsoft servers.
Only a tiny percentage of people use colab for deep learning.
Also, if you search PyTorch tpu you can find details of preliminary support from Google.
PyTorch will make their engineers and scientists a decent amount more productive. I don't see how that's unintelligent at all.
Because TPUs are the only way to fit 300GB backprop onto a single device.
You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model").
When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes.
Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.
Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.
When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.
I use this technique regularly. All you have to do is tf.device(None): # ops go here
The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.
(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)
I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.
the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.
In the link I posted: tpu v2-8 has 64GB of total memory, v2-32 has 256GB.
As for the beefy vm - can you do heavy data preprocessing on tpus? For example elastic distortions or scaling for images? Probably not, because usually it involves OpenCV or similar libraries.
The link is talking about per-core memory. A TPUv2-8 has 300GB system memory, which you can use for training. You can verify this using the notebooks above.
(If a TPUv2-8 has 64GB memory, how can it fine tune GPT-2 1.5B using Adam with batch size 4? That requires almost 300GB.)
A TPUv3 pod is actually a bunch of individual TPUv3-8's linked together. There's 8 cores per device, so a TPUv3-512 has 512 cores divided by 8 cores per device = 64 individual TPUs. (You can get each individual TPU's IP address using `gcloud compute tpus list`: https://imgur.com/Qym4l17)
The big question is, since there are 64 individual TPUs, does that mean we have access to 300GB * 64 = 19.2 TB of memory?
I haven't tested that, but I would bet the answer is yes, for two reasons. 1. I've seen allocations of up to 7TB according to memory usage logs, so 19TB doesn't seem far fetched in comparison. 2. If you create 64 individual TPUv3-8's, then you definitely will have access to 300GB of memory on each TPU, so it's the same engineering problem either way.
Right now, people only seem to use the TPU's CPU for infeed processing / input pipeline transformations. But the CPU is quite fast – it's almost as fast as an actual TPU core.
Also, if you want to play around with a few TPUv3-8's and you have a GCE project, feel free to DM me on twitter. We just figured out how to forward TPUs to VMs in different projects: https://twitter.com/theshawwn/status/1221241517626445826
Is there an official specification clarifying this somewhere?
So you are saying the system memory is 300GB and you can train your model on the cpu instead? Well yeah you can always do that but training will be slow because your model is not trained on the GPU. What’s the point?
If that were the case I am wondering why anyone would buy GPUs? I invite you to retrain a state of the art model of your choice on a CPU and see how far you get.
I wrote code to repeat the wpe variable N times along the context axis during model load time.
Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.
At that point, you can just set context window to a larger value, then train.
Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)
Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.
Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.
The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.
A single TPU v3 has 8 cores, so that’s 128GB memory total, which is more than any single GPU currently.
The TPU software does data parallelism (in Tensorflow) transparently, and it’s somewhat easier to do model parallelism because the memory link is solid and requires no special setup / drivers. You’ll still get an OOM from XLA if you have a tensor that won’t fit in the 16GB of a single core.
TPU pods are easier to use than clusters of infiniband-linked volta boxes. For TPUs you just give GCE money and make some small changes to your use of the TPU API. For the volta cluster you’d probably need to bring your own orchestration (e.g. Horovod). So a TPU pod is easier for one person to use and admin currently.
You're overlooking the strong reasons why OpenAI might want everyone on the same page. (Code reuse, accumulation of expertise, specialization of infrastructure, ...).
Imagine you worked at OpenAI. Imagine you wanted to experiment with Jax, and that it turned out to be the best solution for the problem. Now you can't ship without a solid technical justification.
Except, it's not really a technical justification that you need. You need corporate clout. You can't just be a junior engineer and make a decision that goes against corporate policy. That's the point of having a corporate policy.
I can hear a thousand people about to type "C'mon, OpenAI isn't a normal corporation." But it is. Every corporation is a normal corporation. And having policies against specific tech should make productive programmers pause.
People get jobs at companies based on whether they use React or Vue, for example. And in DL, a programming library is basically a programming language, so it's one step more powerful than that.
Here's an example. Pytorch, as far as I can tell, doesn't support running code on a TPU's CPU. (I could be wrong about this!) When you enumerate the list of accelerators available after connecting to a TPU, you get a list of 8 entries. That means they only support executing code on the cores of a TPU, not the TPU's CPU. This is a huge difference. It means you're restricted to 8GB on TPUv2-8's (which you get on Colab) instead of 300GB.
Does that count as a solid technical justification to use Tensorflow for a research project instead of Pytorch? Who knows. But who wants to be the odd one out on corporate politics? Especially if a project doesn't generate any tangible results, which is often the case for research.