This is a surprisingly unintelligent move from OpenAI. It adds corporate inertia...

Erlich_Bachman · on Jan 30, 2020

Or they see this problem and that's why the policy is sanely phrased as follows:

    Going forward we’ll primarily use PyTorch as our 
    deep learning framework but sometimes use other 
    ones when there’s a specific technical reason 
    to do so.

sillysaurusx · on Jan 30, 2020

It never works out this way in practice. You need corporate clout to go against corporate policy. That's the point of having a corporate policy.

Of course they added that caveat. That's probably how this idea got through in the first place. Just point at the caveat and say "But we're not really throwing all the other frameworks under the bus. If everyone decides it's a good idea to use something else, we'll use something else."

Except that likely won't happen, because now as a junior engineer you need to convince N other people that using Jax was a decent choice. And it's against your company's culture to use anything but Pytorch.

This battle of Tensorflow vs Pytorch is bad for everybody involved. OpenAI released a lot of cool and important code related to Tensorflow. They did GPT-2 (tensorflow 1.x), blocksparse (also tensorflow), memory saving gradients (tensorflow 1.x), and now they're announcing they'll likely never be releasing such tooling again. Memory saving gradients have been hugely helpful to us for scaling our models beyond the normal limits.

gbear605 · on Jan 30, 2020

What you’re ignoring is that the switch isn’t from nothing to Pytorch, it’s from Tensorflow to Pytorch. It’s only favoring one library over another. Your scenario with Jax hasn’t changed, and such tooling is going to be released for Pytorch instead of for Tensorflow. I suspect you’re only against this because you prefer Tensorflow to Pytorch.

Smerity · on Jan 30, 2020

PyTorch already has TPU support via XLA and has been used in research at the scale of billions of parameters.

https://github.com/pytorch/xla

chillee · on Jan 30, 2020

OpenAI always had an "official" framework - it just used to be Tensorflow. That's why many of its public packages (baselines, spinningup, blocksparse, etc.) were built with Tensorflow.

OpenAI has researchers, and it has people who work on infrastructure for the researchers. Researchers are free to use whatever they want, but if the infrastructure developers want to build something for the researchers, it's beneficial to have a "standard".

OpenAI researchers will still be able to use other frameworks for their own research. All this means is that their major infrastructural projects will be released in PyTorch.

sillysaurusx · on Jan 30, 2020

I'll bet you $20 that GPT-3 will be in Pytorch.

My point is that we probably won't be seeing more awesome projects from OpenAI written in Tensorflow. And that's unfortunate. Memory saving gradients were particularly helpful.

chillee · on Jan 30, 2020

I mean, this starts to sound like your complaint isn't that OpenAI is standardizing as a framework but that the framework is PyTorch :)

SpaceManNabs · on Jan 30, 2020

That is usually the case with framework wars.

octbash · on Jan 31, 2020

My counter-arguments (as a huge PyTorch fan) are:

1. GPT hasn't really been about model/architectural experimentation, just scale. GPT-2 and GPT were architecturally very similar. Scale, especially at the scale of GPT-*, is one avenue that TensorFlow does have an edge over PyTorch 2. Work on GPT-3 probably started quite a while ago.

Katabatica · on Jan 31, 2020

AFAIK, the problems with running Pytorch on TPUs have mostly been ironed out.

Also, this move makes a lot of sense for OpenAI. TF is a nightmare of different modules kludged on top of one another, many of which do the same thing. The API has changed so much even in a few years that code from previous versions won't run without -- in some cases -- significant modification. Finally, it's always been horrible to debug, since it obfuscates the actual workings of the network behind a sess.run().

Pytorch is not only a far more productive language (by virtue of the fact that it's far easier to debug), it also has a better ecosystem now because old code still runs. For students, it's also far easier to look at a Pytorch implementation and figure out what the author has actually done. If it's a choice between getting your hands dirty with low-level TF, bending Keras to your will, or putting something together in Pytorch, the latter is just the better choice. It works on TPUs, and it has Tensorboard, a C++ API for robotics, and (afaik) recently developed deployment tools.

The cost of switching from TF to Pytorch is vastly outweighed by the loss of inertia that OpenAI will experience if they don't, simply because everyone else is using a toolkit that they don't support.

mitchellgoffpc · on Jan 30, 2020

I agree with this to some extent, but there are real advantages to having all your code in same framework, and PyTorch is significantly easier to iterate on and debug compared to TensorFlow (in my experience). Hopefully PyTorch will start offering better support for tensor processors like google’s TPUs, but from the sound of it, OpenAI is primarily using Azure for their training infrastructure and I don’t think Microsoft currently offers anything except GPUs.

forgotmyhnacc · on Jan 30, 2020

Why is TPU support important to openAI? They run their code on Microsoft servers. Only a tiny percentage of people use colab for deep learning. Also, if you search PyTorch tpu you can find details of preliminary support from Google.

PyTorch will make their engineers and scientists a decent amount more productive. I don't see how that's unintelligent at all.

sillysaurusx · on Jan 30, 2020

Because TPUs are the only way to fit 300GB backprop onto a single device.

You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model").

When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes.

shoyer · on Jan 30, 2020

I think TPUs are great, but I don't understand what you mean by saying they can support "300 GB backprop" on a single device.

A TPU v3 has 16 GB of high-bandwidth memory per TPU core: https://cloud.google.com/tpu/docs/system-architecture

Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.

sillysaurusx · on Jan 30, 2020

Proof that a TPUv2-8 can do 300GB of backprop: https://twitter.com/theshawwn/status/1196183733755355138

Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.

When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.

I use this technique regularly. All you have to do is tf.device(None): # ops go here

The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.

(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)

For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/

25k context window is far, far out of reach of any GPU for GPT-2.

You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6...

If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1.

EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L...

  !git clone https://github.com/shawwn/gpt-2 /content/gpt-2
  %cd gpt-2
  !pip3 install -r requirements.txt
  !python3 download_model.py 1558M
  !python3 train.py --dataset train.py --model_name 1558M --optimizer adam --batch_size 4

GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png

p1esk · on Jan 30, 2020

I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.

sillysaurusx · on Jan 30, 2020

Mostly because TPUs are in reach of hobbyists. After all, it runs on Colab for free.

In a business context, TPUs seem far cheaper. A preemptible TPUv2-8 only costs $1.35/hr. It looks like 8x Quadro 8000's would cost >$40k.

p1esk · on Jan 31, 2020

Colab is great, can’t argue with free, but in a business context if you look here https://cloud.google.com/tpu/pricing#pricing_example_using_a...

the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.

sillysaurusx · on Jan 31, 2020

A single TPUv2-8 matches 8x quadro 8000 in terms of available memory. (Sort of; the available memory is 300GB, whereas for 8x quadro 8000 it's 384GB.)

TPU pods actually don't require a beefy VM; I'm using a 2GB RAM one.

p1esk · on Jan 31, 2020

In the link I posted: tpu v2-8 has 64GB of total memory, v2-32 has 256GB.

As for the beefy vm - can you do heavy data preprocessing on tpus? For example elastic distortions or scaling for images? Probably not, because usually it involves OpenCV or similar libraries.

sillysaurusx · on Jan 31, 2020

The link is talking about per-core memory. A TPUv2-8 has 300GB system memory, which you can use for training. You can verify this using the notebooks above.

(If a TPUv2-8 has 64GB memory, how can it fine tune GPT-2 1.5B using Adam with batch size 4? That requires almost 300GB.)

p1esk · on Jan 31, 2020

This is interesting. Is there an official specification clarifying this somewhere? Where’s this 300GB of memory physically located?

Are you paying on-demand or preemptible prices? Have you tried larger pod slices to see if they have even more of this “system memory”?

sillysaurusx · on Feb 1, 2020

Yeah, I've seen pod slices allocate 7TB.

A TPUv3 pod is actually a bunch of individual TPUv3-8's linked together. There's 8 cores per device, so a TPUv3-512 has 512 cores divided by 8 cores per device = 64 individual TPUs. (You can get each individual TPU's IP address using `gcloud compute tpus list`: https://imgur.com/Qym4l17)

The big question is, since there are 64 individual TPUs, does that mean we have access to 300GB * 64 = 19.2 TB of memory?

I haven't tested that, but I would bet the answer is yes, for two reasons. 1. I've seen allocations of up to 7TB according to memory usage logs, so 19TB doesn't seem far fetched in comparison. 2. If you create 64 individual TPUv3-8's, then you definitely will have access to 300GB of memory on each TPU, so it's the same engineering problem either way.

Right now, people only seem to use the TPU's CPU for infeed processing / input pipeline transformations. But the CPU is quite fast – it's almost as fast as an actual TPU core.

I wrote up some more about this in a tweet chain if you're interested: https://twitter.com/theshawwn/status/1223395022814339073

Also, if you want to play around with a few TPUv3-8's and you have a GCE project, feel free to DM me on twitter. We just figured out how to forward TPUs to VMs in different projects: https://twitter.com/theshawwn/status/1221241517626445826

Is there an official specification clarifying this somewhere?

Not that I've seen. I stumbled across it by accident. https://twitter.com/theshawwn/status/1163799288771698688

nil-sec · on Feb 1, 2020

So you are saying the system memory is 300GB and you can train your model on the cpu instead? Well yeah you can always do that but training will be slow because your model is not trained on the GPU. What’s the point?

sillysaurusx · on Feb 3, 2020

It's not that slow. And you can use many TPUs together to make up the speed difference.

nil-sec · on Feb 3, 2020

If that were the case I am wondering why anyone would buy GPUs? I invite you to retrain a state of the art model of your choice on a CPU and see how far you get.

sillysaurusx · on Feb 4, 2020

We fine-tuned GPT-2 1.5B for subreddit simulator using this technique. https://www.reddit.com/r/SubSimulatorGPT2Meta/comments/entfg...

octbash · on Jan 31, 2020

How are you using GPT-2 with an expanded context window? I was under the impression that the maximum context window was fixed.

sillysaurusx · on Jan 31, 2020

I wrote code to repeat the wpe variable N times along the context axis during model load time.

Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.

At that point, you can just set context window to a larger value, then train.

octbash · on Jan 31, 2020

Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)

sillysaurusx · on Feb 4, 2020

Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.

Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.

The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.

choppaface · on Jan 30, 2020

A single TPU v3 has 8 cores, so that’s 128GB memory total, which is more than any single GPU currently.

The TPU software does data parallelism (in Tensorflow) transparently, and it’s somewhat easier to do model parallelism because the memory link is solid and requires no special setup / drivers. You’ll still get an OOM from XLA if you have a tensor that won’t fit in the 16GB of a single core.

TPU pods are easier to use than clusters of infiniband-linked volta boxes. For TPUs you just give GCE money and make some small changes to your use of the TPU API. For the volta cluster you’d probably need to bring your own orchestration (e.g. Horovod). So a TPU pod is easier for one person to use and admin currently.

jmoss20 · on Jan 30, 2020

You're overlooking the strong reasons why OpenAI might want everyone on the same page. (Code reuse, accumulation of expertise, specialization of infrastructure, ...).