The thing I'm looking forward to most is having Flash Attention built-in. Right now you have to use xformers or similar, but that dependency has been a nightmare to use, from breaking, to requiring specific concoctions of installing dependencies or else conda will barf, to being impossible to pin because I have to use -dev releases which they constantly drop from the repositories.
PyTorch 2.0 comes with a few different efficient transformer implementations built-in. And unlike 1.13, they work during training and don't require specific configurations. Seemed to work just fine during my pre-release testing. Also, having it built into PyTorch might mean more pressure to keep it optimized. As-is xformers targets A100 primarily, with other archs as an afterthought.
And, as promised, `torch.compile` worked out of the box, providing IIRC a nice ~20% speed up on a ViT without any other tuning.
I did have to do some dependency fiddling on the pre-release version. Been looking forward to the "stable" release before using it more extensively.
Anyone else seeing nice boosts from `torch.compile`?
I really wish compiling cuda extensions worked better out of the box. Is there a reason they can't bundle nvcc alongside pytorch outside of complexity/expense?
I work on xFormers and we definitely appreciate the candid feedback:
- We partnered with our PyTorch colleagues and some of the PyTorch 2.0 kernels for efficient attention actually originated from xFormers, so glad to read that having this now built-in into PyTorch is something users are really eager to use.
- While xFormers was originally targeting a pure researcher audience, we were aware of the installation problems: we started end of last year gradually making it easier to setup and use the library (both internally and externally). We have recently introduced non-dev conda packages, pip wheels and are also trying to release more often,
- We very much welcome hearing about any issue with the library and would certainly love discussing more the specifics of your experience (or others' who read this) if you have time (maybe via our GitHub to start with). Thanks again for the feedback here!
What size of ViT? I’ve tried it with both a unet and an LM and didn’t see any benefit with the default args (and got a CUDA error after 30 mins of processing trying to compile an AR generation routine with all optimization turned on).
>Due to lack of Python 3.11 support for packages that PyTorch depends on, including NumPy, SciPy, SymPy, Pillow and others on the Anaconda platform. We will not be releasing Conda binaries compiled with Python 3.11 for PyTorch Release 2.0. The Pip packages with Python 3.11 support will be released, hence if you intend to use PyTorch 2.0 with Python 3.11 please use our Pip packages.
It really sucks that anaconda always lags behind. I know the reasoning*, and I know it makes sense for what a lot of teams use it for... but on our side we are now looking more and more into dropping it since we are more of an R&D team. We already use containers for most of our pipelines, so just using pip might be viable.
*Though I guess Anaconda chewed more than it can handle w.r.t managing an entire Python universe, and keeping up to date. Conda-forge is already almost a requirement but using the official package (with pip, in this case) has its own benefits for very complex packages like pytorch.
The Arch Linux PyTorch 2.0 packages are great if you are looking for "cutting edge," as they are compiled against CUDA 12.1 now, instead of 11.8 like the official nightly releases. You can also get AVX2 patched Python and optimized C Python packages through CachyOS or ALHP.
These are all legitimate reasons, but my personal experience (and perhaps preference?) is to use Docker for anything that is more complex than pip can handle.
> conda-forge managing builds of some really flaky binary python packages that are sometimes a nightmare to build locally
Yeah this is fair. Fortunately it's becoming rarer.
Well. To each their own, use cases differ so wildly it's hard to compare them.
The key audience for conda is ML/DS space, where most if not all packages come from either C/C++/Rust/Fortran and have to be compiled, while also requiring a consistent set of external C libraries like libblas, etc. As I said, some of those packages are a completely nightmare to build locally. Conda simplifies this by a lot in that you can just 'conda create -n myenv some=1.0 crazy=2.0 deps=2.0' and in a few seconds (if you use mamba and not conda) you have a working Python environment so off you go; no dockers, no local builds etc.
Honestly, I've found that conda has made operationalizing code very difficult. We've found it much easier to simply switch back to using pip, poetry, docker, and the standard OS package management tools rather than conda. Conda's dependency resolution is also quite slow and causes our builds & CI to timeout unless we drop in mamba.
Seems docker is going from the frying pan to the fire. Have they added ‘resume download’ yet to docker? Over my slow DSL I can’t stand how docker makes me download 7G of images when I want to install something very simple, frequently it fails to do the download and I have to do it several times so it adds up to more like 28G of downloading and all that waiting.
I worked at one place where management was shocked when I told them the image build process would take 20 minutes on gigabit fiber up in Canada and we agreed to time it and I measured 18 minutes. Docker slows down “dev” to the speed of “ops.”
I don’t know how they did it but the data scientists could always find f-ed up Python images, you never got the same default character encoding twice, one time the default character set was Hungarian and I wonder how that happens…
pip with wheels doesn't deal with non-python packages. I used to be in a horrible locked down corpo laptop. Conda was invaluable in getting stuff to run, like chromedriver, etc.
I honestly don't remember which one I've used for chromedriver when I needed it for my project, but I've surely installed all the stuff with "just" pip/poetry. Larger projects are typically packaged like this, with setup.py performing the downloads, while wheels solve the problem with Python libraries with native dependencies (e.g. how psycopg-binary works).
Maybe Conda makes it slightly more convenient, but I've always treated pip as the standard Python package management tool (it's a part of the standard library now, after all) and Conda was always "that weird non-standard thing some folks use for some odd reason" for me.
Ah, yeah they do have a Python 3.11 release, just not on anaconda. Okay, yeah, for a couple of years now there isn't a good reason anymore to use anaconda anyways.
For one of my projects, conda ‘just works’ to get working with the GPU but following the instructions for pip doesn’t work. On the other hand there’s another package I am interested in using where I need to build out of GitHub and it’s a very different story.
I see myself as interested in commercial exploitation of transformers right now and I am delighted with the results. The first time I tried clustering all the Ukraine articles lumped together, all the sports were lumped, it runs 5x faster than my LDA-based clustering system and I think does a better job. With results like this I am happy to trade ‘cutting edge’ for convenience.
I have thought about a ‘path less followed’ in Python which is a truly sound package manager like maven for Python (as opposed to Poetry which I’m not sure is sound but it sure is slow) and I can say I like the way conda works I just would rather do it with wheels. One beef I have with conda is that the bzip2 files are slow to decompress and even over a DSL line I would trade a little more downloading for faster installs.
Yes that's the issue! Most of the software is already ready, usable and just works... unless you use anaconda. Now that I think about it, is there some technical reason for that? I always thought it was mostly about stability, but I can't imagine python 3.11 being so unstable as to warrant waiting a whole year before even porting.
> It really sucks that anaconda always lags behind.
I usually just go for virtualenv (if python library versions are the only issue) or go for docker (if it's more than that). Both let you just use the latest and greatest without any friction. conda sits in a weird middle ground that I hate.
Anaconda automatically handles things like making sure the correct version of cuDNN for your graphics card is installed. When I tried doing this myself with venv it was really painful.
i use venv this way. i download and compile specific python versions and install them in a non-system dir with all the other versions. then just run the specific binary to create a venv and it seems to work as expected.
Exactly my setup. I tried to use `conda install` few times, but every time after just few globally installed packages, conda SAT solver always struggles, and I now live with assumption that if incompatible package combination does not throw any error in dev environment, it is likely fine.
There's nothing wrong with this. IMO, Conda is a general-purpose "system environment" and package manager that happens to be written in Python. The fact that its package ecosystem is oriented towards machine learning with Python is almost an historical coincidence.
That's basically where we are at for tons of our pipelines, but it kind of defeats the purpose since a dockerfile with a proper base image is basically equivalent at that point.
In general you shouldn't need to "activate" a Conda environment in the shell. Things generally "just work" if you use absolute paths. Something like this:
What is a little funny is installing a consistent version of Conda inside a container, because the official Miniconda installers are rolling-release only. However you might be able to downgrade to your desired version of Conda after installation.
> *Though I guess Anaconda chewed more than it can handle w.r.t managing an entire Python universe, and keeping up to date. Conda-forge is already almost a requirement but using the official package (with pip, in this case) has its own benefits for very complex packages like pytorch.
Yeah, I absolutely adore conda, but they really need support.
I'm hoping torch.compile is a gateway to "easy" non-Nvidia accelerator support in PyTorch.
Also, I have been using torch.compile for the Stable Diffusion unet/vae since February, to good effect. I'm guessing similar optimizations will pop up for LLaMA.
But I also compile the VAE and some other modules, I will reply again later when I can look at my local code. Some modules (like face restoration or the scheduler) still dont like torch.compile.
I tried changing the options in the config dict one by one, but TBH nothing seems to make a significant difference behind the default settings in benchmarks.
I haven't messed with compiling LORA training yet, as I dont train much and it is sufficiently fast, but I'm sure it could be done.
That's been my experience. However when fallback to CPU happens, it sometimes end up making a specific graph execution slower. But that's explicitly mentioned by the warning and pretty much expected.
Yes, this is my experience. Many off the shelf models still don't work, but several of my own models work great as long as they don't use unsupported operators.
Yes, I am not sure at what extent is MPS a viable alternative to CUDA. You seem to write a lot about ML models. Do you have a detailed write about this subject?
> As an underpinning technology of torch.compile, TorchInductor with Nvidia and AMD GPUs will rely on OpenAI Triton deep learning compiler to generate performant code and hide low level hardware details. OpenAI Triton-generated kernels achieve performance that’s on par with hand-written kernels and specialized cuda libraries such as cublas.
PyTorch 2.0 comes with a few different efficient transformer implementations built-in. And unlike 1.13, they work during training and don't require specific configurations. Seemed to work just fine during my pre-release testing. Also, having it built into PyTorch might mean more pressure to keep it optimized. As-is xformers targets A100 primarily, with other archs as an afterthought.
And, as promised, `torch.compile` worked out of the box, providing IIRC a nice ~20% speed up on a ViT without any other tuning.
I did have to do some dependency fiddling on the pre-release version. Been looking forward to the "stable" release before using it more extensively.
Anyone else seeing nice boosts from `torch.compile`?