At work, we switched over from TensorFlow to PyTorch when 1.0 was released, both for R&D and production... and our productivity and happiness with PyTorch noticeably, significantly improved.
Back when we were using TensorFlow, whenever we wanted to try something new that wasn't already provided out-of-the-box by existing APIs, sooner or later we would find ourselves wrestling with its machinery, especially for models with more complex control flow.
TensorFlow feels like it was built from the ground up to scale up to billions of users and all kinds of devices, with developer productivity and happiness a secondary priority. PyTorch feels like it was built the other way around, prioritizing developer productivity and happiness; other considerations were secondary.
That said, we are keeping an eye on Swift + MLIR + TensorFlow. We think it could unseat PyTorch for R&D and eventually, production, due to (a) the promise of automatic creation of high-performance GPU/TPU kernels without hassle, (b) Swift's easy learning curve, and (c) Swift's fast performance and type safety. Jeremy Howard has a good post about this: https://www.fast.ai/2019/03/06/fastai-swift/
> TensorFlow feels like it was built from the ground up to scale up to billions of users and all kinds of devices, with developer productivity and happiness a secondary priority. PyTorch feels like it was built the other way around, prioritizing developer productivity and happiness; other considerations were secondary.
I recently moved from Google to Facebook and this is how I'd characterize most of the differences I see: Facebook optimizes for your ability to make progress above everything else. Google, not so much.
I'd be surprised if it's dead. At least two of the people on the Swift for Tensorflow team were building a deep learning library in Swift before Swift for Tensorflow was a thing. Much of their work is a part of Swift for Tensorflow.
That's good to hear, but S4TF has failed to gain any real traction among ML users and was started over two years ago. I've heard not many care about it apart from the core team and now that Lattner is gone it's likely Google will send it to the graveyard soon.
I think Lattner saying that just confirms there are concerns.
The only people who seem to be saying it will continue are people outside Google. Crucially there's no one within Google saying they use it for internal projects.
And it's a good news - Swift is a niche, vendor-locked language. Using something like Rust for production level and something like Julia for R&D stage is a better fit.
I first learned TensorFlow and built a lot of very custom stuff with its "Slim" API and have been very happy about it. The only thing I always struggle with is the grunt work of wrangling tensor shapes and indices, e.g. broadcasting, tf.tile, tf.gather, tf.scatter_nd, tf.where etc. These are a real headache to use, while numpy for example has its so called "fancy indexing" and mask-based indexing, which are not available in TF.
Other than that, I'm very happy about TF's graph structure. It means, I can just build the graph (a bit like a declarative style) and in each iteration (session run) I can specify which tensors I'd like to fetch. Then TF takes into account the data dependencies and lazily computes only what is needed. Very convenient and I got very used to this. Maybe someone who just uses PyTorch doesn't miss this because they never used it. My other problem with pytorch is that you have to define every model / layer twice: once in the constructor and once in the forward() function. That's not the best design in terms of DRY. I understand why it's technically needed due to the working principles of PyTorch (especially for weight sharing in different parts of the model, which in TF may need some getting used with its variable scopes and reuse flags), it's just inconvenient.
> Other than that, I'm very happy about TF's graph structure. It means, I can just build the graph (a bit like a declarative style) and in each iteration (session run) I can specify which tensors I'd like to fetch. Then TF takes into account the data dependencies and lazily computes only what is needed. Very convenient and I got very used to this.
That's one of the things we realized we disliked most about TensorFlow: Whenever we tried to do anything that was out-of-the-box, we would have to write Python code that would then construct the TensorFlow graph that would actually specify and eventually run the computations. It felt like metaprogramming from a nice, general-purpose language (Python) in an inflexible, non-general purpose one (TF ops).
> My other problem with pytorch is that you have to define every model / layer twice: once in the constructor and once in the forward() function.
That is not true. You can create graphs, compute losses, and backpropagate them on the fly, because autograd records all operations on the fly.[a] The nn.Module class is a convenient, nice, Pythonic wrapper with lots of functionality, but you don't have to use it. You can create your own wrapper if you'd like.
> You can create graphs, compute losses, and backpropagate them on the fly
Sure, but you'd have to define the weight variables for your conv layer first outside the train loop, then use them within the train loop. With TF I can just create the weights and the ops in one line.
Also, constantly doing .cuda() and .cpu() is not much nicer than feed dicts and fetches.
The real advantage of pytorch is the debuggability, you can set a breakpoint and see every value.
I think I understand your point about declaring and then later using your layers.
Are you aware of the Sequential module? It allows you to chain together layers into a single variable, making this repetition disappear into a single forward/__call__ on the Sequential.
Thanks, that does improve it for simple cases of pure chains of layers without any skip connections etc.
Unfortunately it's most an issue with more complicated architectures. When testing different designs I often forget to modify it in both places and sure it's not a huge deal, but having one place or two does feel like a qualitative difference in cognitive burden. For example if I introduce a bool flag which decides whether I do a particular alternative design, I have to chech the flag and do an if/else in two places: once to create the layers, once to use them.
The problem is not that Tensorflow is based around python instead of swift. The Tensorflow API is a mess. The programming language is not the problem, and switching to swift will only add to the API mess.
Thanks for your comments. I have been using TF just about since the first public version and it is what I am used to. I have been thinking of switching to PyTorch, and I think I will take the plunge and tackle the PyTorch learning curve.
Having transitioned from TF to PyTorch for both research and production, I can tell you the learning curve is pretty mild.
At first, PyTorch will feel like "Numpy on accelerated hardware with built-in backprop and DL/ML facilities." And like Numpy, it will quickly become second nature and get out of your way.
I'm kinda looking forward to see Bazel gone as well, in a few years. The Tensorflow of build systems ;).
It also seems there is something strange going on with open source from Google. Maybe it is as simple as a mismatch of impedance between their monorepo and the outside world, which they don't maintain. Or maybe it is something else.
I've started working with Flux [1] in Julia, and it's so elegant and such a great experience :). Just look at this definition of a U-net model for image segmentation: https://gist.github.com/haampie/bceb1d59fd9a44f092f913062e58.... Apart from that, you can write your own custom loss functions in pure Julia that run efficiently on the GPU, language level automatic differentiation, proper integration with other packages. If people are moving away from Tensorflow, then Flux could be a solid alternative as well.
IMO, these kinds of functional abstractions look nice on paper, but are a pain in the ass to actually use. In practice, you'll want to print out things in between each layer, you'll want to log each layer's activations, you might want to redirect a layer into another network, etc.
Both PyTorch and Tensorflow have purely functional abstractions, but they're relegated to super basic functionalities.
Julia gives you the best of both worlds. And more.
All those pretty function like things you see above are actually callable objects that can be introspected, intercepted and dispatched on...so you can mix and match pure object abstractions, pure function abstractions and objects with function like properties depending on the usecase.
This is because Julia's philosophy is to make Differentiable Programming a completely seamless and normal programming paradigm inter-operable with all standard code patterns.
And all this is only possible because of a unique mix of amazing reflection, code gen (including hooking into the compiler from third party packages, allowing source to source autodiff and GPU codegen), fast generic/parametric polymorphism even across packages, multiple dispatch and macros, among other technologies.
It's not quite at the stage of "write any normal julia code and it just works", as there are some rough edges being worked out, but that's the vision and it's even now it's leaps and bounds above pytorch.
Sorry, can you clarify on what you think is a purely functional abstraction?
Flux is incredibly flexible and can do all sorts of things that are not limited to purely functional code and Flux is capable of many things that are straight up impossible or infeasible in PyTorch or TensorFlow (with or without their 'purely functional' abstractions).
Super late reply, so it's likely you won't see this... (Too bad HN doesn't notify on replies).
I'm not complaining about Flux in general, I'm talking about the specific example (the UNet) he brought up that he uses to claim that Julia is so elegant.
Can you elaborate on what Flux can do that Pytorch can't?
At this point, the DiffEqFlux neural differential equation library fits the neural ODE example from the original paper in 29 seconds [1]. The forward pass of torchdiffeq on trivial ODEs without neural networks takes 47 seconds [2] (and of course adding in neural networks makes it a lot more expensive). This is a massive real-world difference. It means that the Julia packages are building animations by looking at real-time fitting plots, while it's a hours long ordeal in PyTorch. Being able to use optimized packages instead of hardcoding a simple version of things really pays off in the long run, and here using a real ODE solver suite is not a small difference but rather it's multiple orders of magnitude. That's the real benefit of differentiable programming.
As someone who has used both PyTorch and TensorFlow for a couple years now, I can can attest to the faster research iteration times for PyTorch. TensorFlow has always felt like it was designed for some mythical researcher that could come up with a complete architecture ahead of time, based on off-the-shelf parts.
Indeed, no wonder PyTorch has beaten Tensorflow so thoroughly in the last 3 years, going up from 1% of the papers to ~50% of the papers (TensorFlow is now down to only 23% of the papers):
According to the methodology on that page that would classify the standalone version of Keras (using from keras.models imports as recommended by the Keras docs) as "Other". (I tried finding source code to verify this, but couldn't find it)
And if that is correct, then I'd be astonished if the vast majority of the "Other" papers aren't Keras. I work in ML and I don't think I've seen a paper that didn't use PyTorch, TensorFlow or Keras in years.
And is that's the case then almost certainly there are more that use TF than PyTorch: Pytorch is 42%, TF is 23% but Other is 36%.
(In terms of biases, I hate working in Tensorflow, and much prefer PyTorch and Keras. But numbers are numbers).
Are there any papers that use it for things other than demonstrating Jax? I can't think of one off the top of my head.
Perhaps I should have specified "papers outside those introducing new frameworks, or around speed benchmarking".
There are a bunch of interesting papers using custom libraries for distributed training, and ones targeted at showing off the performance of specific hardware (NVidia has a bunch of interesting work in this space, and Intel and other smaller vendors have done things too).
Keras is pretty good unless you hit some custom loss function that needs to do operations that aren't defined in Keras' backend, then you suddenly have to switch over to write them in TensorFlow with some ugly consequences (sometimes you don't know which operations will be GPU-accelerated; slicing vectors to compute and aggregate partial loss functions with some complicated math formulas might force computation onto a CPU).
Happy to see PyTorch get some love. The company I am at made the same switch and everyone has loved PyTorch. It has more expressive power than Tensorflow 1.x (there are models that cannot be done with static graphs) and is simultaneously much easier to use.
Is there any equivalent of TF Serving for PyTorch? We have been thrilled with how robust and easy it is to deploy our models to production on the TF stack, and it worries me that the inertia in the deep learning community seems to be toward PyTorch.
There's also fastapi, which is well-respected in Python.
Someone wrapped fastapi / rabbitRPC to serve PyTorch models (with auto-batching to increase serving efficiency) in https://github.com/catalyst-team/reaction
If PyTorch had a viable way to convert models to run on a mobile GPU or DSP, that's all I'd ever use. Currently I have to do my research in PyTorch and then laboriously port to TF to convert to TFLite, which kinda sucks because TF is full of bugs, and there are gotchas due to differences in how ops are implemented.
It's somewhat disappointing that research is the primary motivator for the switch. PyTorch still has a ways to go in tooling for toy usage of models and deployment of models to production compared to TensorFlow (incidentally, GPT-2, the most public of OpenAI's released models, uses TensorFlow 1.X as a base). For AI newbies, I've seen people recommend PyTorch over TensorFlow just because "all the big players are using it," without listing the caveats.
The future of AI research will likely be interoperability between multiple frameworks to support both needs (e.g. HuggingFace Transformers which started as PyTorch-only but now also supports TF 2.X with relative feature parity).
OpenAI is a fundamentally a research laboratory - their top priority is pushing the envelope on machine learning research and releasing their progress responsibly. Anything that reduces experimentation and research speed should be avoided, thus the switch to PyTorch makes sense.
As long as as OpenAI is open sourcing their work, there will always be others that will port it over to other frameworks.
Making AI more open is synonymous with making AI more accessible, which (IMO) is much better facilitated with TensorFlow/Keras versus PyTorch.
Many AI tutorials imply that the more complicated an AI approach is, the more effective it is, which isn't practical, especially for newbies without a deep background.
Accessible to non-researchers, especially those with a programming background but not an AI background.
The TF/Keras approach advocates the minimum amount of code necessary and effort needed to make model changes, with sensible default configurations and layer architectures.
STRONGLY disagree. I’m just a hobbyist, but trying to read Keras models can be a god damn nightmare if the author has to do anything even slightly non-standard. Keras seems to REALLY want you to believe that you can just throw a bunch of layers together and call .fit and everything will just work, but it never seems to be that simple unless you’re training on MNIST or ImageNet.
I disagree with you because we switched from TF 1/Keras to PyTorch and our codebase reduced to half of its size. Our team is a bunch of developers with little AI background. Problem with TensorFlow is it is mostly not readable and onboarding new people to project is really hard. In contrast, PyTorch much more readable and people with python background can easily adapt to a pytorch project after a small machine learning lesson.
Especially with the caveat of "with a programming background", it is far easier to reason and debug through PyTorch with just Python knowledge, compared to TensorFlow/Keras, which sooner or later requires you to learn a condensed history of TensorFlow/Keras development to understand why things are the way they are.
is NOT a good example of a beginner friendly library. It's a thin wrapper facade that hides all of the actual complexity behind "Train ImageNet in 3 lines of code!"
The reason why Keras became so popular is that it borrowed a lot of concepts from Lua Torch (which predates even Theano). And anyone who worked with Torch immediately sees it reading Keras code. But Torch was Lua and naturally it received less recognition than it deserved. Your will not lose anything by simply moving to PyTorch.
Check out the fastai library. It's something like Keras is for Tensorflow.
As a non-researcher, mostly programmer who has spent a lot of time delving into this ecosystem, PyTorch is the most like "standard programming". With fastai giving you models to do working three liners.
I haven't used tensorflow interactive execution though, it supposedly is closer to PyTorch than the graph building model.
I am a non-researcher with a programming background working in ML for the past ~2 years, pytorch was a godsend for me and felt much more programmatic and pythonic than TensorFlow. Keras is also good, but claiming that PyTorch makes it harder for non-researchers is wrong IMO.
The idea that Keras would be the framework of choice for a research organization is laughable. If you're talking TF internals, those aren't really any more "accessible" than pytorch, which to many (including me) feels quite idiomatic
The aforementioned GPT-2 and Spinning Up were released, open-sourced tools using TF; however, other recent models such as the OpenAI Five Dota 2 bot, the robot dexterity demo, and the hide-and-seek simulation were not open-sourced.
If OpenAI is signaling that they are changing their open-source strategy, they should be more explicit about that.
This is a surprisingly unintelligent move from OpenAI. It adds corporate inertia to something as mundane as choice of DL framework.
Imagine you worked at OpenAI. Imagine you wanted to experiment with Jax, and that it turned out to be the best solution for the problem. Now you can't ship without a solid technical justification.
Except, it's not really a technical justification that you need. You need corporate clout. You can't just be a junior engineer and make a decision that goes against corporate policy. That's the point of having a corporate policy.
I can hear a thousand people about to type "C'mon, OpenAI isn't a normal corporation." But it is. Every corporation is a normal corporation. And having policies against specific tech should make productive programmers pause.
People get jobs at companies based on whether they use React or Vue, for example. And in DL, a programming library is basically a programming language, so it's one step more powerful than that.
Here's an example. Pytorch, as far as I can tell, doesn't support running code on a TPU's CPU. (I could be wrong about this!) When you enumerate the list of accelerators available after connecting to a TPU, you get a list of 8 entries. That means they only support executing code on the cores of a TPU, not the TPU's CPU. This is a huge difference. It means you're restricted to 8GB on TPUv2-8's (which you get on Colab) instead of 300GB.
Does that count as a solid technical justification to use Tensorflow for a research project instead of Pytorch? Who knows. But who wants to be the odd one out on corporate politics? Especially if a project doesn't generate any tangible results, which is often the case for research.
Or they see this problem and that's why the policy is sanely phrased as follows:
Going forward we’ll primarily use PyTorch as our
deep learning framework but sometimes use other
ones when there’s a specific technical reason
to do so.
It never works out this way in practice. You need corporate clout to go against corporate policy. That's the point of having a corporate policy.
Of course they added that caveat. That's probably how this idea got through in the first place. Just point at the caveat and say "But we're not really throwing all the other frameworks under the bus. If everyone decides it's a good idea to use something else, we'll use something else."
Except that likely won't happen, because now as a junior engineer you need to convince N other people that using Jax was a decent choice. And it's against your company's culture to use anything but Pytorch.
This battle of Tensorflow vs Pytorch is bad for everybody involved. OpenAI released a lot of cool and important code related to Tensorflow. They did GPT-2 (tensorflow 1.x), blocksparse (also tensorflow), memory saving gradients (tensorflow 1.x), and now they're announcing they'll likely never be releasing such tooling again. Memory saving gradients have been hugely helpful to us for scaling our models beyond the normal limits.
What you’re ignoring is that the switch isn’t from nothing to Pytorch, it’s from Tensorflow to Pytorch. It’s only favoring one library over another. Your scenario with Jax hasn’t changed, and such tooling is going to be released for Pytorch instead of for Tensorflow. I suspect you’re only against this because you prefer Tensorflow to Pytorch.
OpenAI always had an "official" framework - it just used to be Tensorflow. That's why many of its public packages (baselines, spinningup, blocksparse, etc.) were built with Tensorflow.
OpenAI has researchers, and it has people who work on infrastructure for the researchers. Researchers are free to use whatever they want, but if the infrastructure developers want to build something for the researchers, it's beneficial to have a "standard".
OpenAI researchers will still be able to use other frameworks for their own research. All this means is that their major infrastructural projects will be released in PyTorch.
My point is that we probably won't be seeing more awesome projects from OpenAI written in Tensorflow. And that's unfortunate. Memory saving gradients were particularly helpful.
1. GPT hasn't really been about model/architectural experimentation, just scale. GPT-2 and GPT were architecturally very similar. Scale, especially at the scale of GPT-*, is one avenue that TensorFlow does have an edge over PyTorch
2. Work on GPT-3 probably started quite a while ago.
AFAIK, the problems with running Pytorch on TPUs have mostly been ironed out.
Also, this move makes a lot of sense for OpenAI. TF is a nightmare of different modules kludged on top of one another, many of which do the same thing. The API has changed so much even in a few years that code from previous versions won't run without -- in some cases -- significant modification. Finally, it's always been horrible to debug, since it obfuscates the actual workings of the network behind a sess.run().
Pytorch is not only a far more productive language (by virtue of the fact that it's far easier to debug), it also has a better ecosystem now because old code still runs. For students, it's also far easier to look at a Pytorch implementation and figure out what the author has actually done. If it's a choice between getting your hands dirty with low-level TF, bending Keras to your will, or putting something together in Pytorch, the latter is just the better choice. It works on TPUs, and it has Tensorboard, a C++ API for robotics, and (afaik) recently developed deployment tools.
The cost of switching from TF to Pytorch is vastly outweighed by the loss of inertia that OpenAI will experience if they don't, simply because everyone else is using a toolkit that they don't support.
I agree with this to some extent, but there are real advantages to having all your code in same framework, and PyTorch is significantly easier to iterate on and debug compared to TensorFlow (in my experience). Hopefully PyTorch will start offering better support for tensor processors like google’s TPUs, but from the sound of it, OpenAI is primarily using Azure for their training infrastructure and I don’t think Microsoft currently offers anything except GPUs.
Why is TPU support important to openAI? They run their code on Microsoft servers.
Only a tiny percentage of people use colab for deep learning.
Also, if you search PyTorch tpu you can find details of preliminary support from Google.
PyTorch will make their engineers and scientists a decent amount more productive. I don't see how that's unintelligent at all.
Because TPUs are the only way to fit 300GB backprop onto a single device.
You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model").
When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes.
Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.
Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.
When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.
I use this technique regularly. All you have to do is tf.device(None): # ops go here
The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.
(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)
I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.
the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.
In the link I posted: tpu v2-8 has 64GB of total memory, v2-32 has 256GB.
As for the beefy vm - can you do heavy data preprocessing on tpus? For example elastic distortions or scaling for images? Probably not, because usually it involves OpenCV or similar libraries.
The link is talking about per-core memory. A TPUv2-8 has 300GB system memory, which you can use for training. You can verify this using the notebooks above.
(If a TPUv2-8 has 64GB memory, how can it fine tune GPT-2 1.5B using Adam with batch size 4? That requires almost 300GB.)
A TPUv3 pod is actually a bunch of individual TPUv3-8's linked together. There's 8 cores per device, so a TPUv3-512 has 512 cores divided by 8 cores per device = 64 individual TPUs. (You can get each individual TPU's IP address using `gcloud compute tpus list`: https://imgur.com/Qym4l17)
The big question is, since there are 64 individual TPUs, does that mean we have access to 300GB * 64 = 19.2 TB of memory?
I haven't tested that, but I would bet the answer is yes, for two reasons. 1. I've seen allocations of up to 7TB according to memory usage logs, so 19TB doesn't seem far fetched in comparison. 2. If you create 64 individual TPUv3-8's, then you definitely will have access to 300GB of memory on each TPU, so it's the same engineering problem either way.
Right now, people only seem to use the TPU's CPU for infeed processing / input pipeline transformations. But the CPU is quite fast – it's almost as fast as an actual TPU core.
Also, if you want to play around with a few TPUv3-8's and you have a GCE project, feel free to DM me on twitter. We just figured out how to forward TPUs to VMs in different projects: https://twitter.com/theshawwn/status/1221241517626445826
Is there an official specification clarifying this somewhere?
So you are saying the system memory is 300GB and you can train your model on the cpu instead? Well yeah you can always do that but training will be slow because your model is not trained on the GPU. What’s the point?
If that were the case I am wondering why anyone would buy GPUs? I invite you to retrain a state of the art model of your choice on a CPU and see how far you get.
I wrote code to repeat the wpe variable N times along the context axis during model load time.
Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.
At that point, you can just set context window to a larger value, then train.
Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)
Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.
Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.
The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.
A single TPU v3 has 8 cores, so that’s 128GB memory total, which is more than any single GPU currently.
The TPU software does data parallelism (in Tensorflow) transparently, and it’s somewhat easier to do model parallelism because the memory link is solid and requires no special setup / drivers. You’ll still get an OOM from XLA if you have a tensor that won’t fit in the 16GB of a single core.
TPU pods are easier to use than clusters of infiniband-linked volta boxes. For TPUs you just give GCE money and make some small changes to your use of the TPU API. For the volta cluster you’d probably need to bring your own orchestration (e.g. Horovod). So a TPU pod is easier for one person to use and admin currently.
You're overlooking the strong reasons why OpenAI might want everyone on the same page. (Code reuse, accumulation of expertise, specialization of infrastructure, ...).
This was pretty evident for a few years, and it's one of the top reasons for us to not provide official binaries with CUDA support -- the maintainer overhead was way too much. We did work to make sure it still builds with CUDA support from source (with a contbuild) but once CUDA 10.3 or 11 releases, we have to drop that too.
Ah thanks for that. One of my biggest concerns right now is that since SIMD won out in the performance wars, and has come to be dominated by the video game industry and proprietary players like NVIDIA, that we are missing out on a whole possible tree of evolution in computer science.
For one, that we don't have easy access to MIMD, so we can't easily/cheaply experiment with our own simulations for things like genetic algorithms.
20 years ago I wanted to go into AI research and make a multicore FPGA (say 1000+ cores) where each one could run its own instance of an OS, or at the very least an isolated runtime for something like Lisp. But the world has gone a completely different direction, and that's great and everything with all the recent advances in machine learning, but it's like comparing rasterization (what we have) to ray tracing (what we could have had). Current implementations are orders of magnitude more complex than they need to be. I've written about this a bunch:
So I guess short of this, I hope that PyTorch can at least provide a cross-platform performant SIMD implementation. Which I had hoped OpenCL would be, but maybe it's too much like OpenGL and we need something a level of abstraction higher for easier vector processing without all the worrying about buffers and moving between CPU and GPU.
> Please if someone at PyTorch is reading this, put in a request to make CUDA support the default on Mac OS.
It's unlikely this will ever happen. Apple doesn't officially support NVIDIA drivers anymore and even Tensorflow no longer lists MacOS as having official GPU support[0].
I would if I could - I have an external GPU at home. Unfortunately Apple is (not without reason) angry at nvidia so they dropped support for Nvidia in Mac OS. I’d have to use Windows which is a big no no for me. Obvious pytorch can’t support it.
I think it was just a matter of time till TF would get superseded by PyTorch. The only reason we kept TF on prod is the java api which allowed us to quickly load and serve TF models. I spent so many sleepless nights trying to port Torch model to TF back in the days and make it work the same as Lua based prototype. Whole TF "experience" made us switch to plain Python services model throwing away all the boilerplate Scala/Java code for TF. It doesn't happen often in tech that a better engineered product gets more traction and recognition eventually and I am glad that PyTorch did.
I believe these days one has to know both, TensorFlow (Keras) and PyTorch; most new research is in PyTorch and most deployments are in TensorFlow. Academia can afford to run on PyTorch only, stable businesses on TensorFlow only, but for individual developers they need to know both.
Has anyone taken the course mentioned "Spinning Up in Deep RL"? I've been meaning to learning some Deep RL and I was wondering if this is the best first step.
Back when we were using TensorFlow, whenever we wanted to try something new that wasn't already provided out-of-the-box by existing APIs, sooner or later we would find ourselves wrestling with its machinery, especially for models with more complex control flow.
TensorFlow feels like it was built from the ground up to scale up to billions of users and all kinds of devices, with developer productivity and happiness a secondary priority. PyTorch feels like it was built the other way around, prioritizing developer productivity and happiness; other considerations were secondary.
That said, we are keeping an eye on Swift + MLIR + TensorFlow. We think it could unseat PyTorch for R&D and eventually, production, due to (a) the promise of automatic creation of high-performance GPU/TPU kernels without hassle, (b) Swift's easy learning curve, and (c) Swift's fast performance and type safety. Jeremy Howard has a good post about this: https://www.fast.ai/2019/03/06/fastai-swift/