Hacker News new | past | comments | ask | show | jobs | submit login
Standardizing OpenAI’s deep learning framework on PyTorch (openai.com)
242 points by pesenti on Jan 30, 2020 | hide | past | favorite | 116 comments



At work, we switched over from TensorFlow to PyTorch when 1.0 was released, both for R&D and production... and our productivity and happiness with PyTorch noticeably, significantly improved.

Back when we were using TensorFlow, whenever we wanted to try something new that wasn't already provided out-of-the-box by existing APIs, sooner or later we would find ourselves wrestling with its machinery, especially for models with more complex control flow.

TensorFlow feels like it was built from the ground up to scale up to billions of users and all kinds of devices, with developer productivity and happiness a secondary priority. PyTorch feels like it was built the other way around, prioritizing developer productivity and happiness; other considerations were secondary.

That said, we are keeping an eye on Swift + MLIR + TensorFlow. We think it could unseat PyTorch for R&D and eventually, production, due to (a) the promise of automatic creation of high-performance GPU/TPU kernels without hassle, (b) Swift's easy learning curve, and (c) Swift's fast performance and type safety. Jeremy Howard has a good post about this: https://www.fast.ai/2019/03/06/fastai-swift/


If you are looking for more performance and type safety versus python, PyTorch does have a C++ frontend.

https://pytorch.org/cppdocs/frontend.html#end-to-end-example


> TensorFlow feels like it was built from the ground up to scale up to billions of users and all kinds of devices, with developer productivity and happiness a secondary priority. PyTorch feels like it was built the other way around, prioritizing developer productivity and happiness; other considerations were secondary.

I recently moved from Google to Facebook and this is how I'd characterize most of the differences I see: Facebook optimizes for your ability to make progress above everything else. Google, not so much.


Seems to hold up for most google stuff I tried. E.g. kubernetes and angular.


Chris Lattner left Google a few days ago. Swift for TensorFlow is likely dead.


I'd be surprised if it's dead. At least two of the people on the Swift for Tensorflow team were building a deep learning library in Swift before Swift for Tensorflow was a thing. Much of their work is a part of Swift for Tensorflow.


That's good to hear, but S4TF has failed to gain any real traction among ML users and was started over two years ago. I've heard not many care about it apart from the core team and now that Lattner is gone it's likely Google will send it to the graveyard soon.


https://twitter.com/clattner_llvm/status/1222032740897284097

Jeff Dean seems to disagree.

Also, I think it hasn't picked up steam because it just isn't mature enough yet


That's a very non-committed response from Jeff Dean!

I suspect it will drift along like the Java TensorFlow wrapper. Useful in some specific circumstances, but unused outside that.



I think Lattner saying that just confirms there are concerns.

The only people who seem to be saying it will continue are people outside Google. Crucially there's no one within Google saying they use it for internal projects.


And it's a good news - Swift is a niche, vendor-locked language. Using something like Rust for production level and something like Julia for R&D stage is a better fit.


What about autodiff? Is that a separate effort?


I first learned TensorFlow and built a lot of very custom stuff with its "Slim" API and have been very happy about it. The only thing I always struggle with is the grunt work of wrangling tensor shapes and indices, e.g. broadcasting, tf.tile, tf.gather, tf.scatter_nd, tf.where etc. These are a real headache to use, while numpy for example has its so called "fancy indexing" and mask-based indexing, which are not available in TF.

Other than that, I'm very happy about TF's graph structure. It means, I can just build the graph (a bit like a declarative style) and in each iteration (session run) I can specify which tensors I'd like to fetch. Then TF takes into account the data dependencies and lazily computes only what is needed. Very convenient and I got very used to this. Maybe someone who just uses PyTorch doesn't miss this because they never used it. My other problem with pytorch is that you have to define every model / layer twice: once in the constructor and once in the forward() function. That's not the best design in terms of DRY. I understand why it's technically needed due to the working principles of PyTorch (especially for weight sharing in different parts of the model, which in TF may need some getting used with its variable scopes and reuse flags), it's just inconvenient.


> Other than that, I'm very happy about TF's graph structure. It means, I can just build the graph (a bit like a declarative style) and in each iteration (session run) I can specify which tensors I'd like to fetch. Then TF takes into account the data dependencies and lazily computes only what is needed. Very convenient and I got very used to this.

That's one of the things we realized we disliked most about TensorFlow: Whenever we tried to do anything that was out-of-the-box, we would have to write Python code that would then construct the TensorFlow graph that would actually specify and eventually run the computations. It felt like metaprogramming from a nice, general-purpose language (Python) in an inflexible, non-general purpose one (TF ops).

> My other problem with pytorch is that you have to define every model / layer twice: once in the constructor and once in the forward() function.

That is not true. You can create graphs, compute losses, and backpropagate them on the fly, because autograd records all operations on the fly.[a] The nn.Module class is a convenient, nice, Pythonic wrapper with lots of functionality, but you don't have to use it. You can create your own wrapper if you'd like.

[a] https://pytorch.org/docs/stable/notes/autograd.html


> You can create graphs, compute losses, and backpropagate them on the fly

Sure, but you'd have to define the weight variables for your conv layer first outside the train loop, then use them within the train loop. With TF I can just create the weights and the ops in one line.

Also, constantly doing .cuda() and .cpu() is not much nicer than feed dicts and fetches.

The real advantage of pytorch is the debuggability, you can set a breakpoint and see every value.


I think I understand your point about declaring and then later using your layers.

Are you aware of the Sequential module? It allows you to chain together layers into a single variable, making this repetition disappear into a single forward/__call__ on the Sequential.


Thanks, that does improve it for simple cases of pure chains of layers without any skip connections etc.

Unfortunately it's most an issue with more complicated architectures. When testing different designs I often forget to modify it in both places and sure it's not a huge deal, but having one place or two does feel like a qualitative difference in cognitive burden. For example if I introduce a bool flag which decides whether I do a particular alternative design, I have to chech the flag and do an if/else in two places: once to create the layers, once to use them.


The problem is not that Tensorflow is based around python instead of swift. The Tensorflow API is a mess. The programming language is not the problem, and switching to swift will only add to the API mess.


Thanks for your comments. I have been using TF just about since the first public version and it is what I am used to. I have been thinking of switching to PyTorch, and I think I will take the plunge and tackle the PyTorch learning curve.


Having transitioned from TF to PyTorch for both research and production, I can tell you the learning curve is pretty mild.

At first, PyTorch will feel like "Numpy on accelerated hardware with built-in backprop and DL/ML facilities." And like Numpy, it will quickly become second nature and get out of your way.

Good luck with the transition!


Swift for tensorflow is probably dead in the water though, now that Chris Lattner left Google this week: https://www.businesswire.com/news/home/20200127005141/en/Goo...


I'm kinda looking forward to see Bazel gone as well, in a few years. The Tensorflow of build systems ;).

It also seems there is something strange going on with open source from Google. Maybe it is as simple as a mismatch of impedance between their monorepo and the outside world, which they don't maintain. Or maybe it is something else.


If you want type safety on top of Pytorch, checkout this project https://github.com/NVIDIA/NeMo


It'd be nice to have the same pytorch abstractions but in a more statically typed language than pytorch.


You can use PyTorch Rust bindings [1] or OCaml bindings [2].

[1] https://github.com/LaurentMazare/tch-rs

[2] https://github.com/LaurentMazare/ocaml-torch


there's a C++14 frontend

https://pytorch.org/cppdocs/frontend.html#end-to-end-example

though the typing doesn't dive into tensor dimensions etc.


I've started working with Flux [1] in Julia, and it's so elegant and such a great experience :). Just look at this definition of a U-net model for image segmentation: https://gist.github.com/haampie/bceb1d59fd9a44f092f913062e58.... Apart from that, you can write your own custom loss functions in pure Julia that run efficiently on the GPU, language level automatic differentiation, proper integration with other packages. If people are moving away from Tensorflow, then Flux could be a solid alternative as well.

[1] https://github.com/FluxML/Flux.jl


IMO, these kinds of functional abstractions look nice on paper, but are a pain in the ass to actually use. In practice, you'll want to print out things in between each layer, you'll want to log each layer's activations, you might want to redirect a layer into another network, etc.

Both PyTorch and Tensorflow have purely functional abstractions, but they're relegated to super basic functionalities.


Julia gives you the best of both worlds. And more.

All those pretty function like things you see above are actually callable objects that can be introspected, intercepted and dispatched on...so you can mix and match pure object abstractions, pure function abstractions and objects with function like properties depending on the usecase.

This is because Julia's philosophy is to make Differentiable Programming a completely seamless and normal programming paradigm inter-operable with all standard code patterns.

And all this is only possible because of a unique mix of amazing reflection, code gen (including hooking into the compiler from third party packages, allowing source to source autodiff and GPU codegen), fast generic/parametric polymorphism even across packages, multiple dispatch and macros, among other technologies.

It's not quite at the stage of "write any normal julia code and it just works", as there are some rough edges being worked out, but that's the vision and it's even now it's leaps and bounds above pytorch.


Sorry, can you clarify on what you think is a purely functional abstraction?

Flux is incredibly flexible and can do all sorts of things that are not limited to purely functional code and Flux is capable of many things that are straight up impossible or infeasible in PyTorch or TensorFlow (with or without their 'purely functional' abstractions).


Super late reply, so it's likely you won't see this... (Too bad HN doesn't notify on replies).

I'm not complaining about Flux in general, I'm talking about the specific example (the UNet) he brought up that he uses to claim that Julia is so elegant.

Can you elaborate on what Flux can do that Pytorch can't?


At this point, the DiffEqFlux neural differential equation library fits the neural ODE example from the original paper in 29 seconds [1]. The forward pass of torchdiffeq on trivial ODEs without neural networks takes 47 seconds [2] (and of course adding in neural networks makes it a lot more expensive). This is a massive real-world difference. It means that the Julia packages are building animations by looking at real-time fitting plots, while it's a hours long ordeal in PyTorch. Being able to use optimized packages instead of hardcoding a simple version of things really pays off in the long run, and here using a real ODE solver suite is not a small difference but rather it's multiple orders of magnitude. That's the real benefit of differentiable programming.

  [1] https://github.com/JuliaDiffEq/DiffEqFlux.jl#training-a-neural-ordinary-differential-equation
  [2] https://gist.github.com/ChrisRackauckas/cc6ac746e2dfd285c28e0584a2bfd320


yes but you can't pay me enough to get me to learn Julia


Any particular reason why?


As someone who has used both PyTorch and TensorFlow for a couple years now, I can can attest to the faster research iteration times for PyTorch. TensorFlow has always felt like it was designed for some mythical researcher that could come up with a complete architecture ahead of time, based on off-the-shelf parts.


Indeed, no wonder PyTorch has beaten Tensorflow so thoroughly in the last 3 years, going up from 1% of the papers to ~50% of the papers (TensorFlow is now down to only 23% of the papers):

https://paperswithcode.com/trends


According to the methodology on that page that would classify the standalone version of Keras (using from keras.models imports as recommended by the Keras docs) as "Other". (I tried finding source code to verify this, but couldn't find it)

And if that is correct, then I'd be astonished if the vast majority of the "Other" papers aren't Keras. I work in ML and I don't think I've seen a paper that didn't use PyTorch, TensorFlow or Keras in years.

And is that's the case then almost certainly there are more that use TF than PyTorch: Pytorch is 42%, TF is 23% but Other is 36%.

(In terms of biases, I hate working in Tensorflow, and much prefer PyTorch and Keras. But numbers are numbers).


Jax?


Are there any papers that use it for things other than demonstrating Jax? I can't think of one off the top of my head.

Perhaps I should have specified "papers outside those introducing new frameworks, or around speed benchmarking".

There are a bunch of interesting papers using custom libraries for distributed training, and ones targeted at showing off the performance of specific hardware (NVidia has a bunch of interesting work in this space, and Intel and other smaller vendors have done things too).


It's still early days for JAX, but there's neural tangents https://arxiv.org/abs/1912.02803 and reformer https://arxiv.org/abs/2001.04451 from iclr.


I agree about it being early days.

Reformer is a good example that I'd missed.

Neural Tangents is another paper demoing a framework.


Keras is pretty good unless you hit some custom loss function that needs to do operations that aren't defined in Keras' backend, then you suddenly have to switch over to write them in TensorFlow with some ugly consequences (sometimes you don't know which operations will be GPU-accelerated; slicing vectors to compute and aggregate partial loss functions with some complicated math formulas might force computation onto a CPU).


Happy to see PyTorch get some love. The company I am at made the same switch and everyone has loved PyTorch. It has more expressive power than Tensorflow 1.x (there are models that cannot be done with static graphs) and is simultaneously much easier to use.


Is there any equivalent of TF Serving for PyTorch? We have been thrilled with how robust and easy it is to deploy our models to production on the TF stack, and it worries me that the inertia in the deep learning community seems to be toward PyTorch.


Have you checked out Cortex? It's an open source platform for deploying PyTorch models easily. I wrote an article for the PyTorch blog about it: https://medium.com/pytorch/how-to-build-production-software-...

GitHub: https://github.com/cortexlabs/cortex

Full disclosure/shameless plug: I work on Cortex


Thanks! I was not aware of this and it looks fantastic. Is it in the roadmap to target GCP, or even just a generic Kubernetes cluster?


GCP is on the immediate short-term roadmap, and we're investigating on-premise, but don't have a firm timeline on it quite (we're still a small team).


There's also fastapi, which is well-respected in Python. Someone wrapped fastapi / rabbitRPC to serve PyTorch models (with auto-batching to increase serving efficiency) in https://github.com/catalyst-team/reaction


You can convert the checkpoints over for deployment. I agree the serving infra on TF is much better.


This is second large framework making the switch to Pytorch.

https://medium.com/syncedreview/japanese-unicorn-preferred-n...


If PyTorch had a viable way to convert models to run on a mobile GPU or DSP, that's all I'd ever use. Currently I have to do my research in PyTorch and then laboriously port to TF to convert to TFLite, which kinda sucks because TF is full of bugs, and there are gotchas due to differences in how ops are implemented.


It's somewhat disappointing that research is the primary motivator for the switch. PyTorch still has a ways to go in tooling for toy usage of models and deployment of models to production compared to TensorFlow (incidentally, GPT-2, the most public of OpenAI's released models, uses TensorFlow 1.X as a base). For AI newbies, I've seen people recommend PyTorch over TensorFlow just because "all the big players are using it," without listing the caveats.

The future of AI research will likely be interoperability between multiple frameworks to support both needs (e.g. HuggingFace Transformers which started as PyTorch-only but now also supports TF 2.X with relative feature parity).


OpenAI is a fundamentally a research laboratory - their top priority is pushing the envelope on machine learning research and releasing their progress responsibly. Anything that reduces experimentation and research speed should be avoided, thus the switch to PyTorch makes sense.

As long as as OpenAI is open sourcing their work, there will always be others that will port it over to other frameworks.


> It's somewhat disappointing that research is the primary motivator for the switch.

They are a research organization, how is that disappointing?


Making AI more open is synonymous with making AI more accessible, which (IMO) is much better facilitated with TensorFlow/Keras versus PyTorch.

Many AI tutorials imply that the more complicated an AI approach is, the more effective it is, which isn't practical, especially for newbies without a deep background.


Accessible to whom? What makes tensorflow/keras more accessible than pytorch?


Accessible to non-researchers, especially those with a programming background but not an AI background.

The TF/Keras approach advocates the minimum amount of code necessary and effort needed to make model changes, with sensible default configurations and layer architectures.


STRONGLY disagree. I’m just a hobbyist, but trying to read Keras models can be a god damn nightmare if the author has to do anything even slightly non-standard. Keras seems to REALLY want you to believe that you can just throw a bunch of layers together and call .fit and everything will just work, but it never seems to be that simple unless you’re training on MNIST or ImageNet.


I disagree with you because we switched from TF 1/Keras to PyTorch and our codebase reduced to half of its size. Our team is a bunch of developers with little AI background. Problem with TensorFlow is it is mostly not readable and onboarding new people to project is really hard. In contrast, PyTorch much more readable and people with python background can easily adapt to a pytorch project after a small machine learning lesson.


Minimum=/=lowest effort.

Especially with the caveat of "with a programming background", it is far easier to reason and debug through PyTorch with just Python knowledge, compared to TensorFlow/Keras, which sooner or later requires you to learn a condensed history of TensorFlow/Keras development to understand why things are the way they are.

In my opinion,

  import lib
  lib.train("imagenet", "resnet50", epochs=10)
  lib.eval()
is NOT a good example of a beginner friendly library. It's a thin wrapper facade that hides all of the actual complexity behind "Train ImageNet in 3 lines of code!"


Fair; maybe minimum isn't the right word. More like "minimum without full abstraction."

The Keras examples are a good reference (e.g. https://www.tensorflow.org/tutorials/keras/classification ); even without an AI background, you have a sense of both what's going on how to tweak the model to improve it.


The reason why Keras became so popular is that it borrowed a lot of concepts from Lua Torch (which predates even Theano). And anyone who worked with Torch immediately sees it reading Keras code. But Torch was Lua and naturally it received less recognition than it deserved. Your will not lose anything by simply moving to PyTorch.


Check out the fastai library. It's something like Keras is for Tensorflow.

As a non-researcher, mostly programmer who has spent a lot of time delving into this ecosystem, PyTorch is the most like "standard programming". With fastai giving you models to do working three liners.

I haven't used tensorflow interactive execution though, it supposedly is closer to PyTorch than the graph building model.


I am a non-researcher with a programming background working in ML for the past ~2 years, pytorch was a godsend for me and felt much more programmatic and pythonic than TensorFlow. Keras is also good, but claiming that PyTorch makes it harder for non-researchers is wrong IMO.


The idea that Keras would be the framework of choice for a research organization is laughable. If you're talking TF internals, those aren't really any more "accessible" than pytorch, which to many (including me) feels quite idiomatic


OpenAI doesn't have any products right? So research is a lot more important than production for them. Surprises me it took them this long.


The aforementioned GPT-2 and Spinning Up were released, open-sourced tools using TF; however, other recent models such as the OpenAI Five Dota 2 bot, the robot dexterity demo, and the hide-and-seek simulation were not open-sourced.

If OpenAI is signaling that they are changing their open-source strategy, they should be more explicit about that.


This is a surprisingly unintelligent move from OpenAI. It adds corporate inertia to something as mundane as choice of DL framework.

Imagine you worked at OpenAI. Imagine you wanted to experiment with Jax, and that it turned out to be the best solution for the problem. Now you can't ship without a solid technical justification.

Except, it's not really a technical justification that you need. You need corporate clout. You can't just be a junior engineer and make a decision that goes against corporate policy. That's the point of having a corporate policy.

I can hear a thousand people about to type "C'mon, OpenAI isn't a normal corporation." But it is. Every corporation is a normal corporation. And having policies against specific tech should make productive programmers pause.

People get jobs at companies based on whether they use React or Vue, for example. And in DL, a programming library is basically a programming language, so it's one step more powerful than that.

Here's an example. Pytorch, as far as I can tell, doesn't support running code on a TPU's CPU. (I could be wrong about this!) When you enumerate the list of accelerators available after connecting to a TPU, you get a list of 8 entries. That means they only support executing code on the cores of a TPU, not the TPU's CPU. This is a huge difference. It means you're restricted to 8GB on TPUv2-8's (which you get on Colab) instead of 300GB.

Does that count as a solid technical justification to use Tensorflow for a research project instead of Pytorch? Who knows. But who wants to be the odd one out on corporate politics? Especially if a project doesn't generate any tangible results, which is often the case for research.


Or they see this problem and that's why the policy is sanely phrased as follows:

    Going forward we’ll primarily use PyTorch as our 
    deep learning framework but sometimes use other 
    ones when there’s a specific technical reason 
    to do so.


It never works out this way in practice. You need corporate clout to go against corporate policy. That's the point of having a corporate policy.

Of course they added that caveat. That's probably how this idea got through in the first place. Just point at the caveat and say "But we're not really throwing all the other frameworks under the bus. If everyone decides it's a good idea to use something else, we'll use something else."

Except that likely won't happen, because now as a junior engineer you need to convince N other people that using Jax was a decent choice. And it's against your company's culture to use anything but Pytorch.

This battle of Tensorflow vs Pytorch is bad for everybody involved. OpenAI released a lot of cool and important code related to Tensorflow. They did GPT-2 (tensorflow 1.x), blocksparse (also tensorflow), memory saving gradients (tensorflow 1.x), and now they're announcing they'll likely never be releasing such tooling again. Memory saving gradients have been hugely helpful to us for scaling our models beyond the normal limits.


What you’re ignoring is that the switch isn’t from nothing to Pytorch, it’s from Tensorflow to Pytorch. It’s only favoring one library over another. Your scenario with Jax hasn’t changed, and such tooling is going to be released for Pytorch instead of for Tensorflow. I suspect you’re only against this because you prefer Tensorflow to Pytorch.


PyTorch already has TPU support via XLA and has been used in research at the scale of billions of parameters.

https://github.com/pytorch/xla


OpenAI always had an "official" framework - it just used to be Tensorflow. That's why many of its public packages (baselines, spinningup, blocksparse, etc.) were built with Tensorflow.

OpenAI has researchers, and it has people who work on infrastructure for the researchers. Researchers are free to use whatever they want, but if the infrastructure developers want to build something for the researchers, it's beneficial to have a "standard".

OpenAI researchers will still be able to use other frameworks for their own research. All this means is that their major infrastructural projects will be released in PyTorch.


I'll bet you $20 that GPT-3 will be in Pytorch.

My point is that we probably won't be seeing more awesome projects from OpenAI written in Tensorflow. And that's unfortunate. Memory saving gradients were particularly helpful.


I mean, this starts to sound like your complaint isn't that OpenAI is standardizing as a framework but that the framework is PyTorch :)


That is usually the case with framework wars.


My counter-arguments (as a huge PyTorch fan) are:

1. GPT hasn't really been about model/architectural experimentation, just scale. GPT-2 and GPT were architecturally very similar. Scale, especially at the scale of GPT-*, is one avenue that TensorFlow does have an edge over PyTorch 2. Work on GPT-3 probably started quite a while ago.


AFAIK, the problems with running Pytorch on TPUs have mostly been ironed out.

Also, this move makes a lot of sense for OpenAI. TF is a nightmare of different modules kludged on top of one another, many of which do the same thing. The API has changed so much even in a few years that code from previous versions won't run without -- in some cases -- significant modification. Finally, it's always been horrible to debug, since it obfuscates the actual workings of the network behind a sess.run().

Pytorch is not only a far more productive language (by virtue of the fact that it's far easier to debug), it also has a better ecosystem now because old code still runs. For students, it's also far easier to look at a Pytorch implementation and figure out what the author has actually done. If it's a choice between getting your hands dirty with low-level TF, bending Keras to your will, or putting something together in Pytorch, the latter is just the better choice. It works on TPUs, and it has Tensorboard, a C++ API for robotics, and (afaik) recently developed deployment tools.

The cost of switching from TF to Pytorch is vastly outweighed by the loss of inertia that OpenAI will experience if they don't, simply because everyone else is using a toolkit that they don't support.


I agree with this to some extent, but there are real advantages to having all your code in same framework, and PyTorch is significantly easier to iterate on and debug compared to TensorFlow (in my experience). Hopefully PyTorch will start offering better support for tensor processors like google’s TPUs, but from the sound of it, OpenAI is primarily using Azure for their training infrastructure and I don’t think Microsoft currently offers anything except GPUs.


Why is TPU support important to openAI? They run their code on Microsoft servers. Only a tiny percentage of people use colab for deep learning. Also, if you search PyTorch tpu you can find details of preliminary support from Google.

PyTorch will make their engineers and scientists a decent amount more productive. I don't see how that's unintelligent at all.


Because TPUs are the only way to fit 300GB backprop onto a single device.

You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model").

When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes.


I think TPUs are great, but I don't understand what you mean by saying they can support "300 GB backprop" on a single device.

A TPU v3 has 16 GB of high-bandwidth memory per TPU core: https://cloud.google.com/tpu/docs/system-architecture

Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.


Proof that a TPUv2-8 can do 300GB of backprop: https://twitter.com/theshawwn/status/1196183733755355138

Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.

When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.

I use this technique regularly. All you have to do is tf.device(None): # ops go here

The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.

(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)

For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/

25k context window is far, far out of reach of any GPU for GPT-2.

You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6...

If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1.

EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L...

  !git clone https://github.com/shawwn/gpt-2 /content/gpt-2
  %cd gpt-2
  !pip3 install -r requirements.txt
  !python3 download_model.py 1558M
  !python3 train.py --dataset train.py --model_name 1558M --optimizer adam --batch_size 4
GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png


I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.


Mostly because TPUs are in reach of hobbyists. After all, it runs on Colab for free.

In a business context, TPUs seem far cheaper. A preemptible TPUv2-8 only costs $1.35/hr. It looks like 8x Quadro 8000's would cost >$40k.


Colab is great, can’t argue with free, but in a business context if you look here https://cloud.google.com/tpu/pricing#pricing_example_using_a...

the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.


A single TPUv2-8 matches 8x quadro 8000 in terms of available memory. (Sort of; the available memory is 300GB, whereas for 8x quadro 8000 it's 384GB.)

TPU pods actually don't require a beefy VM; I'm using a 2GB RAM one.


In the link I posted: tpu v2-8 has 64GB of total memory, v2-32 has 256GB.

As for the beefy vm - can you do heavy data preprocessing on tpus? For example elastic distortions or scaling for images? Probably not, because usually it involves OpenCV or similar libraries.


The link is talking about per-core memory. A TPUv2-8 has 300GB system memory, which you can use for training. You can verify this using the notebooks above.

(If a TPUv2-8 has 64GB memory, how can it fine tune GPT-2 1.5B using Adam with batch size 4? That requires almost 300GB.)


This is interesting. Is there an official specification clarifying this somewhere? Where’s this 300GB of memory physically located?

Are you paying on-demand or preemptible prices? Have you tried larger pod slices to see if they have even more of this “system memory”?


Yeah, I've seen pod slices allocate 7TB.

A TPUv3 pod is actually a bunch of individual TPUv3-8's linked together. There's 8 cores per device, so a TPUv3-512 has 512 cores divided by 8 cores per device = 64 individual TPUs. (You can get each individual TPU's IP address using `gcloud compute tpus list`: https://imgur.com/Qym4l17)

The big question is, since there are 64 individual TPUs, does that mean we have access to 300GB * 64 = 19.2 TB of memory?

I haven't tested that, but I would bet the answer is yes, for two reasons. 1. I've seen allocations of up to 7TB according to memory usage logs, so 19TB doesn't seem far fetched in comparison. 2. If you create 64 individual TPUv3-8's, then you definitely will have access to 300GB of memory on each TPU, so it's the same engineering problem either way.

Right now, people only seem to use the TPU's CPU for infeed processing / input pipeline transformations. But the CPU is quite fast – it's almost as fast as an actual TPU core.

I wrote up some more about this in a tweet chain if you're interested: https://twitter.com/theshawwn/status/1223395022814339073

Also, if you want to play around with a few TPUv3-8's and you have a GCE project, feel free to DM me on twitter. We just figured out how to forward TPUs to VMs in different projects: https://twitter.com/theshawwn/status/1221241517626445826

Is there an official specification clarifying this somewhere?

Not that I've seen. I stumbled across it by accident. https://twitter.com/theshawwn/status/1163799288771698688


So you are saying the system memory is 300GB and you can train your model on the cpu instead? Well yeah you can always do that but training will be slow because your model is not trained on the GPU. What’s the point?


It's not that slow. And you can use many TPUs together to make up the speed difference.


If that were the case I am wondering why anyone would buy GPUs? I invite you to retrain a state of the art model of your choice on a CPU and see how far you get.


We fine-tuned GPT-2 1.5B for subreddit simulator using this technique. https://www.reddit.com/r/SubSimulatorGPT2Meta/comments/entfg...


How are you using GPT-2 with an expanded context window? I was under the impression that the maximum context window was fixed.


I wrote code to repeat the wpe variable N times along the context axis during model load time.

Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.

At that point, you can just set context window to a larger value, then train.


Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)


Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.

Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.

The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.


A single TPU v3 has 8 cores, so that’s 128GB memory total, which is more than any single GPU currently.

The TPU software does data parallelism (in Tensorflow) transparently, and it’s somewhat easier to do model parallelism because the memory link is solid and requires no special setup / drivers. You’ll still get an OOM from XLA if you have a tensor that won’t fit in the 16GB of a single core.

TPU pods are easier to use than clusters of infiniband-linked volta boxes. For TPUs you just give GCE money and make some small changes to your use of the TPU API. For the volta cluster you’d probably need to bring your own orchestration (e.g. Horovod). So a TPU pod is easier for one person to use and admin currently.


You're overlooking the strong reasons why OpenAI might want everyone on the same page. (Code reuse, accumulation of expertise, specialization of infrastructure, ...).


Just FYI I looked at PyTorch for the first time now, and unfortunately they require Mac OS users to build it from source in order to get CUDA support:

https://pytorch.org/get-started/locally/

Please if someone at PyTorch is reading this, put in a request to make CUDA support the default on Mac OS.

Also, it looks like PyTorch doesn't currently support OpenCL:

https://github.com/pytorch/pytorch/issues/488

I can't tell by the issue comments if it's been added yet or if they plan to use Intel's oneAPI or similar.

To me, these are prerequisites for switching to PyTorch. Hopefully someone can clarify the state of these thanks!


Hi I am a PyTorch maintainer.

NVIDIA has dropped CUDA support for macOS: http://www.cgchannel.com/2019/11/nvidia-drops-macos-support-...

This was pretty evident for a few years, and it's one of the top reasons for us to not provide official binaries with CUDA support -- the maintainer overhead was way too much. We did work to make sure it still builds with CUDA support from source (with a contbuild) but once CUDA 10.3 or 11 releases, we have to drop that too.


Ah thanks for that. One of my biggest concerns right now is that since SIMD won out in the performance wars, and has come to be dominated by the video game industry and proprietary players like NVIDIA, that we are missing out on a whole possible tree of evolution in computer science.

For one, that we don't have easy access to MIMD, so we can't easily/cheaply experiment with our own simulations for things like genetic algorithms.

20 years ago I wanted to go into AI research and make a multicore FPGA (say 1000+ cores) where each one could run its own instance of an OS, or at the very least an isolated runtime for something like Lisp. But the world has gone a completely different direction, and that's great and everything with all the recent advances in machine learning, but it's like comparing rasterization (what we have) to ray tracing (what we could have had). Current implementations are orders of magnitude more complex than they need to be. I've written about this a bunch:

https://news.ycombinator.com/item?id=17759391

https://news.ycombinator.com/item?id=17419917

So I guess short of this, I hope that PyTorch can at least provide a cross-platform performant SIMD implementation. Which I had hoped OpenCL would be, but maybe it's too much like OpenGL and we need something a level of abstraction higher for easier vector processing without all the worrying about buffers and moving between CPU and GPU.


> Please if someone at PyTorch is reading this, put in a request to make CUDA support the default on Mac OS.

It's unlikely this will ever happen. Apple doesn't officially support NVIDIA drivers anymore and even Tensorflow no longer lists MacOS as having official GPU support[0].

Don't hold your breath.

[0]: https://www.tensorflow.org/install/gpu


Are you really GPU training on your home laptop? I absolutely get why CUDA support for MacOS isn't a priority


I would if I could - I have an external GPU at home. Unfortunately Apple is (not without reason) angry at nvidia so they dropped support for Nvidia in Mac OS. I’d have to use Windows which is a big no no for me. Obvious pytorch can’t support it.


I think it was just a matter of time till TF would get superseded by PyTorch. The only reason we kept TF on prod is the java api which allowed us to quickly load and serve TF models. I spent so many sleepless nights trying to port Torch model to TF back in the days and make it work the same as Lua based prototype. Whole TF "experience" made us switch to plain Python services model throwing away all the boilerplate Scala/Java code for TF. It doesn't happen often in tech that a better engineered product gets more traction and recognition eventually and I am glad that PyTorch did.


Pytorch actually got an experimental Java API in version 1.4 (about two weeks ago), if you're interested.


I believe these days one has to know both, TensorFlow (Keras) and PyTorch; most new research is in PyTorch and most deployments are in TensorFlow. Academia can afford to run on PyTorch only, stable businesses on TensorFlow only, but for individual developers they need to know both.


For folks interested in Julia and RL, I've been involved in https://www.lyceum.ml/ a set of tools for continuous control problems like robotics.

It's pretty quick.


Yeah!! Let's switch to Lyceum!


Has anyone taken the course mentioned "Spinning Up in Deep RL"? I've been meaning to learning some Deep RL and I was wondering if this is the best first step.


The lecture series and reference material by David Silver: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html. You can then supplement that with "Spinning up in Deep RL" for more hands-on experiments.





Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: