For improvements I commented: Remove setup.py files and mandate wheels. This is ...

dalke · on Sept 7, 2022

> Remove setup.py files and mandate wheels

What alternative is there for me?

My package has a combination of hand-built C extensions and Cython extensions, as well as a code generation step during compilation. These are handled through a subclass of setuptools.command.build_ext.build_ext.

Furthermore, I have compile-time options to enable/disable certain configuration options, like enabling/disabling support for OpenMP, via environment variables so they can be passed through from pip.

OpenMP is a compile-time option because the default C compiler on macOS doesn't include OpenMP. You need to install it, using one of various approaches. Which is why I only have a source distribution for macOS, along with a description of the approaches.

I have not found a non-setup.py way to handle my configuration, nor to provide macOS wheels.

Even for the Linux wheels, I have to patch the manylinux Docker container to whitelist libomp (the OpenMP library), using something like this:

  RUN perl -i -pe 's/"libresolv.so.2"/"libresolv.so.2", "libgomp.so.1"/'
    /opt/_internal/pipx/venvs/auditwheel/lib/python3.9/site-packages/
    auditwheel/policy/manylinux-policy.json

Oh, and if compiling where platform.machine() == "arm64" then I need to not add the AVX2 compiler flag.

The non-setup.py packaging systems I've looked at are for Python-only code bases. Or, if I understand things correctly, I'm supposed to make a new specialized package which implements PEP 518, which I can then use to boot-strap my code.

Except, that's still going to use effectively arbitrary code during the compilation step (to run the codegen) and still use setup.py to build the extension. So it's not like the evil disappears.

orf · on Sept 7, 2022

To be clear, I’m not suggesting we remove the ability to compile native extensions.

I’m suggesting we find a better way to build them, something a bit more structured, and decouple that specific use case from setup.py.

It would be cool to be able to structure this in a way that means I can describe what system libraries I may need without having to execute setup.py and find out, and express compile time flags or options in a structured way.

Think of it like cargo.toml va build.rs.

dalke · on Sept 7, 2022

I agree it would be cool and useful.

But it appears to be such a hard problem that modern packaging tools ignore it, preferring to take on other challenges instead.

My own attempts at extracting Python configuration information to generate a Makefile for personal use (because Makefile understand dependencies better than setup.py) is a mess caused by my failure to understand what all the configuration options do.

Given that's the case, when do you think we'll be able to "Remove setup.py files and mandate wheels"?

I'm curious on what evils you're thinking of? I assume the need to run arbitrary Python code just to find metadata is one of them. But can't that be resolved with a pyproject.toml which uses setuptools only for the build backend? So you don't need to remove setup.py, only restrict when it's used, yes?

infogulch · on Sept 7, 2022

The closest thing I've seen to a solution in this space is Riff, discussed yesterday [1], which solves the external dependency problem for rust projects.

[1]: https://news.ycombinator.com/item?id=32739954

dalke · on Sept 8, 2022

Agreed.

In my answers to the survey, I mentioned "nix" was the technology most likely to affect the future of Python packaging, in part because of reading that same article on Riff.

I think now I should have mentioned Riff too.

dec0dedab0de · on Sept 7, 2022

The ability to create a custom package that can run any custom code you want at install time is very powerful. I think a decent solution would be to have a way to mark a package as trusted, and only allow pre/post scripts if they are indeed trusted. Maybe even have specific permissions that can be granted, but that seems like a ton of work to get right across operating systems.

My specific use cases are adding custom CA certs to certifi after it is installed, and modifying the maximum version of a requirement listed for an abandoned library that works fine with a newer version.

I think the best solutions would be an official way to ignore dependencies for a specific package, and specify replacement packages in a project's dependencies. Something like this if it were a Pipfile:

  public-package = {version = "~=1.0",replace_with='path/to/local-package'}
  abandoned-package = {version = "~=*",ignore_dependencies=True}

But the specific problem doesn't matter, what matters is that there will always be exceptions. This is Python, we're all adults here, and we should be able to easily modify things to get them to work the way we want them to. Any protections added should include a way to be dangerous.

I know your point is more about requiring static metadata than using wheels per se. I just believe that all things Python should be flexible and hack-able. There are other more rigid languages if you're into that sort of thing.

edit:

before anyone starts getting angry I know there are other ways to solve the problems I mentioned.

forking/vendoring is a bit of overkill for such a small change, and doesn't solve for when a dependency of a dependency needs to be modified.

monkeypatching works fine, however it would need to be done at all the launch points of the project, and even then if I open a repl and import a specific module to try something it won't have my modifications.

modifying an installed package at runtime works reasonably well, but it can cause a performance hit at launch, and while it only needs to be run once, it still needs to be run once. So if the first thing you do after recreating a virualenv is to try something with an existing module we have the same problem as monkey patching.

'just use docker' or maybe the more toned down version: 'create a real setup script for developers' are both valid solutions, and where I'll probably end up. It was just very useful to be able to modify things in a pinch.

Spiritus · on Sept 8, 2022

Well we're almost there I think. You can define dependencies and other metadata in pyproject.toml nowadays:

https://setuptools.pypa.io/en/latest/userguide/pyproject_con...

dalke · on Sept 8, 2022

How do I specify that I need gfortran installed?

Spiritus · on Sept 8, 2022

You can't. But is that possible with any programming language specific package manager? How would that even work given that every flavour of OS/distro have their own way of providing gfortran?

dalke · on Sept 8, 2022

You can't. But my g'parent comment in this thread was because my Python module needs the OpenMP library, or compile-time detection that it wasn't there, to skip OpenMP support. The latter is done by an environment variable which my setup.py understands.

Then orf dreamed of a day where you could "describe what system libraries I may need without having to execute setup.py and find out, and express compile time flags."

The link you pointed doesn't appear to handle what we were talking about. By specifying "gfortran", I hoped to highlight that difference.

riff, building on nix, seems an intriguing solution for this.

korijn · on Sept 7, 2022

I emphatize with your situation and it's a great example. As crazy as this may sound, I think you would have to build every possible permutation of your library and make all of them available on pypi. You'd need a some new mechanism based on metadata to represent all the options and figure out how to resolve against available system libraries. Especially that last part seems very complicated. But I do think it's possible.

pabs3 · on Sept 8, 2022

> not add the AVX2 compiler flag

It is a better idea to do instruction selection at runtime in the code that currently uses AVX2. I recently wrote some docs for Debian contributors about the different ways to achieve this:

https://wiki.debian.org/InstructionSelection

dalke · on Sept 8, 2022

I do that, using manual CPUID tests, along with allowing environment variables to override the default path choices.

But if the compiler by default doesn't enable AVX2 then it will fail to compile the AVX2 intrinsics unless I add -mavx2.

Even worse was ~10 years ago when I had an SSSE3 code path, with one file using SSSE3 intrinsics.

I had to compile only that file for SSSE3, and not the rest of the package, as otherwise the compiler would issue SSSE3 instructions where it decided was appropriate. Including in code that wasn't behind a CPUID check.

Thus crash on hardware without SSSE3.

See https://stackoverflow.com/questions/15527611/how-do-i-specif... for more info about my solution. Someone last year contributed a solution for MS Windows.

pabs3 · on Sept 8, 2022

See the wiki page, the function multi-versioning stuff means you can use AVX2 in select functions without adding -mavx2. And using SIMD Everywhere you can automatically port that to ARM NEON, POWER AltiVec etc.

dalke · on Sept 8, 2022

EDIT: after I wrote the below I realize I could use automatic multi-versioning solely to configure the individual functions, along with with a stub function indicating "was compiled for this arch?" I think that might be more effective should I need to revisit how I support multiple processor architecture dispatch. I will still need the code generation step.

Automatic multi-versioning doesn't handled what I needed, at least not when I started.

I needed a fast way to compute the popcount.

10 years ago, before most machines supported POPCNT, I implemented a variety of popcount algorithms (see https://jcheminf.biomedcentral.com/articles/10.1186/s13321-0... ) and found that the fastest version depended on more that just the CPU instruction set.

I ended up running some timings during startup to figure out the fastest version appropriate to the given hardware, with the option to override it (via environment variables) for things like benchmark comparisons. I used it to generate that table I linked to.

Function multi-versioning - which I only learned about a few month ago - isn't meant to handle that flexibility. To my understanding.

I still have one code path which uses __popcountll built-in intrinsics and another which has inline POPCNT assembly, so I can identify when it's no longer useful to have the inline assembly.

(Though I used AVX2 if available, I've also read that some of the AMD processors have several POPCNT execution ports, so may be faster than using AVX2 for my 1024-bit popcount case. I have the run-time option to choose which to use, if I ever have access to those processors.)

Furthermore, my code generation has one path for single-threaded use and one code path for OpenMP, because I found single-threaded-using-OpenMP was slower than single-threaded-without-OpenMP and it would crash on multithreaded macOS programs, due to conflicts between gcc's OpenMP implementation and Apple's POSIX threads implementation.

The AVX2 popcount is from Muła, Kurz, and Lemire, https://academic.oup.com/comjnl/article-abstract/61/1/111/38... , with manually added prefetch instructions (implemented by Kurz). It does not appear that SIMD Everywhere is the right route for me.

pabs3 · on Sept 8, 2022

If you implement your own ifunc instead of using the compiler-supplied FMV ifunc, you could do your benchmarks from your custom ifunc that runs before the program main() and choose the fastest function pointer that way. I don't think FMV can currently do that automatically, theoretically it could but that would require on additional modifications to GCC/LLVM. From the sounds of it, running an ifunc might be too early for you though, if you have to init OpenMP or something non-stateless before benchmarking.

SIMD Everywhere is for a totally different situation; if you want to automatically port your AVX2 code to ARM NEON/etc without having to manually rewrite the AVX2 intrinsics to ARM ones.

blibble · on Sept 7, 2022

> The mission statement they are proposing, “a packaging ecosystem for all”, completely misses the mark. How about a “packaging ecosystem that works” first?

I think at the point a programming language is going on about "mission statements" for a packaging tool, you know they've lost the plot

copy Maven from 2004 (possibly with less XML)

that's it, problem solved

ziml77 · on Sept 7, 2022

I tend to just give up on a package if it requires a C toolchain to install. Even if I do end up getting things set up in a way that the library's build script is happy with, I'll be inflicting pain on anyone else who then tries to work with my code.

cycomanic · on Sept 7, 2022

I know this is unpopular opinion on here, but I believe all this packaging madness is forced on us by languages because Windows (and to a lesser degree osx) have essentially no package management.

Especially installing a tool chain to compile C code for python is no issue on Linux, but such a pain on Windows.

chlorion · on Sept 8, 2022

It may be unpopular but it's correct!

Every language tries to re-implement the package manager, but it ends up breaking down as soon as you need to interact with anything outside of that specific language's ecosystem. The only solution for interacting with the "outside" (other languages, toolchains, etc) is a system level, language agnostic package manager of some kind.

Linux distros package management is far from perfect but it's still miles ahead of the alternatives!

I very highly recommend people to learn how to write and create Linux packages if they need to distribute software. On Arch for example this would be creating PKGBUILDs, Gentoo has ebuilds, and other distros have something similar to these things.

Const-me · on Sept 8, 2022

> it ends up breaking down as soon as you need to interact with anything outside of that specific language's ecosystem.

It works OK in nuget on Windows, due to the different approach taken by the OS maintainer.

A DLL compiled 25 years ago for Windows NT stills work on a modern Windows, as long as the process is 32-bit. A DLL compiled 15 years ago for 64-bit Vista will still work on a modern Windows without issues at all.

People who need native code in their nuget packages are simply shipping native DLLs, often with static link to C runtime. Probably the most popular example of such package is SQLite.

> I very highly recommend people to learn how to write and create Linux packages if they need to distribute software.

I agree. When working on embedded Linux software where I control the environment and don’t care about compatibility with different Linuxes or different CPU architectures, I often package my software into *.deb packages.

humanrebar · on Sept 7, 2022

C tends to work in those cases because there aren't a significant number of interesting C dependencies to add... because there is no standard C build system, packaging format, or packaging tools.

When juggling as many transitive dependencies in C as folks do with node, python, etc., there's plenty of pain to deal with.

tux3 · on Sept 7, 2022

It feels so suboptimal to need the C toolchain to do things, but having no solid way to depend on it as a non-C library (especially annoying in Rust, which insists on building everything from source and never installing libraries globally).

I make a tool/library that requires the C toolchain at runtime. That's even worse than build time, I need end users to have things like lld, objdump, ranlib, etc installed anywhere they use it. My options are essentially:

- Requiring users to just figure it out with their system package manager

- Building the C toolchain from source at build time and statically linking it (so you get to spend an hour or two recompiling all of LLVM each time you update or clear your package cache! Awesome!),

- Building just LLD/objdump/.. at build-time (but user still need to install LLVM. So you get both slow installs AND have to deal with finding a compatible copy of libLLVM),

- Pre-compiling all the C tools and putting them in a storage bucket somewhere, for all architectures and all OS versions. But then not have support when things like the M1 or new OS versions right away, or people on uncommon OSes. And now need to maintain a build machine for all of these myself.

- Pre-compile the whole C toolchain to WASM, build Wasmtime from source instead, and just eat the cost of Cranelift running LLVM 5-10x slower than natively...

I keep trying to work around the C toolchain, but I still can't see any very good solution that doesn't make my users have extra problems one way or another.

Hey RiiR evangelism people, anyone want to tackle all of LLVM? .. no? No one? :)

gwenzek · on Sept 8, 2022

I feel Zig could help here. The binaries ship with LLVM statically linked. You could rely on them to provide binaries for a variety of architecture / OS, and use it to compile code on the target machine. I'll probably explore this at some point for Pip.

korijn · on Sept 7, 2022

...and ensure _all_ package metadata required to perform dependency resolution can be retrieved through an API (in other words without downloading wheels).

orf · on Sept 7, 2022

Yeah, that’s sort of what I meant by my suggestion. Requirements that can only be resolved by downloading and executing code is a huge burden on tooling

LukeShu · on Sept 7, 2022

If the package is available as a wheel, you don't need to execute code to see what the requirements are; you just need to parse the "METADATA" file. However, the only way to get the METADATA for a wheel (using PyPA standard APIs, anyway) is to download the whole wheel.

For comparison, pacman (the Arch Linux package manager) packages have fairly similar ".PKGINFO" file in them; but in order to support resolving dependencies without downloading the packages, the server's repository index includes not just a listing of the (name, version) tuple for each package, it also includes each package's full .PKGINFO.

Enhancing the PyPA "Simple repository API" to allow fetching the METADATA independently of the wheel would be a relatively simple enhancement that would make a big difference.

----

As I was writing this comment, I discovered that PyPA did this; they adopted PEP 658 in March of this year! https://github.com/pypa/packaging.python.org/commit/1ebb57b7...

jbylund · on Sept 8, 2022

Pip can use range requests to fetch just a part of the wheel, and lift the metadata out of that. So it can sometimes avoid downloading the entire wheel just to get the deps. Some package servers don't support this though.

Also, there's a difference between a pep being adopted and that pep being implemented (usually a bunch of elbow grease). That said there are a couple exciting steps towards 658 being implemented: https://github.com/pypa/pip/pull/11111 (just approved yesterday, not yet merged) https://github.com/pypi/warehouse/issues/8254 (been open for forever, but there has been incremental progress made. Warehouse seems to not attract the same amount of contribution as pip)

korijn · on Sept 7, 2022

Yeah. Well, mandating wheels and getting rid of setup.py at least avoids having to run scripts, and indeed enables the next step which would be indexing all the metadata and exposing it through an API. I just thought it wouldn't necessarily be obvious to all readers of your comment.

orf · on Sept 7, 2022

Just to be clear, package metadata already is sort of available through the pypi json api. I've got the entire set of all package metadata here: https://github.com/orf/pypi-data

  $ gzcat release_data/c/d/cdklabs.cdk-hyperledger-fabric-network.json.gz | jq '. | to_entries | .[].value.info.requires_dist' | head
  [
    "typeguard (~=2.13.3)",
    "publication (>=0.0.3)",
    "jsii (<2.0.0,>=1.63.2)",
    "constructs (<11.0.0,>=10.0.5)",
    "aws-cdk-lib (<3.0.0,>=2.33.0)"
  ]

It's just not everything has it, and there isn't a way to differenciate between "missing" and "no dependencies". And it's also only for the `dist` releases. But anyway, poetry uses this information during dependency resolution.

korijn · on Sept 8, 2022

I'm aware! The issue is the mixed content (wheels, and... not wheels) on pypi. If the data is incomplete, it's useless in the sense that you're never going to be able to guarantee good results.

dalke · on Sept 7, 2022

What if I have a dependency on a commercial third-party Python package which is on Conda but not on PyPI?

mistrial9 · on Sept 7, 2022

you are placing open code in a vendor lock-in, to start

dalke · on Sept 7, 2022

Yes, I understand that.

I see I misunderstood korijn's comment. My earlier reply is off-topic, so I won't continue further off the track.

progval · on Sept 7, 2022

> For improvements I commented: Remove setup.py files and mandate wheels.

This would make most C extensions impossible to install on anything other than x86_64-pc-linux-gnu (or arm-linux-gnueabihf/aarch64-linux-gnu if you are lucky) because developers don't want to bother building wheels for them.

mathstuf · on Sept 7, 2022

I think it'd make other things impossible too. One project I help maintain is C++ and is mainly so. It optionally has Python bindings. It also has something like 150 options to the build that affect things. There is zero chance of me ever attempting to make `setup.py` any kind of sensible "entry point" to the build. Instead, the build detects "oh, you want a wheel" and generates `setup.py` to just grab what the C++ build then drops into a place where `build_ext` or whatever expects them to be using some fun globs. It also fills in "features" or whatever the post-name `[name]` stuff is called so you can do some kind of post-build "ok, it has a feature I need" inspection.

urschrei · on Sept 7, 2022

cibuildwheel (which is an official, supported tool) has made this enormously easier. I test and generate wheels with a compiled (Rust! Because of course) extension using a Cython bridge for all supported Python versions for 32-bit and 64-bit Windows, macOS x86_64 and arm64, and whatever manylinux is calling itself this week. No user compilation required. It took about half a day to set up, and is extremely well documented.

kevin_thibedeau · on Sept 7, 2022

Setup.py can do things wheels can't. Most notably it's the only installation method that can invoke 2to3 at runtime without requiring a dev to create multiple packages.

orf · on Sept 7, 2022

It’s lucky Python 2 isn’t supported anymore then, and everyone has had like a decade to run 2to3 once and publish a package for Python 3, so that use case becomes meaningless.

coldtea · on Sept 7, 2022

You'd be surprised at how many billions lines of production code are still at 2 (and could not care less whether it's end-of-lined)

orf · on Sept 7, 2022

I'm not surprised at all, but regardless they also should not be similarly surprised if people could not care less about that use case.

mistrial9 · on Sept 7, 2022

very unfortunately the direct burden of python2 is placed on the packagers.. users of Python 2 like their libs (me) and have no horse in this demonization campaign

orf · on Sept 7, 2022

Pay for support for Python 2 then? At which point it’s a burden on the person you are paying. Or don't, in which case you're complaining that people are demonizing you because they are not doing your work for free?

mistrial9 · on Sept 8, 2022

this is battle-fatigue in action! I did not complain, in fact I am faced with the serious burden that is placed on packagers of python, on a regular basis, and have put many cycles of thought into it.. It is obvious that packagers and language designers are firmly in one end of a sort of spectrum.. while many users, perhaps engineering managers in production with sunk costs, are firmly at the other.. with many in between .. more could be said.. no one is an enemy on this, it is complicated to solve it with many tradeoffs and moving parts.. evidence in the topic today.

cozzyd · on Sept 8, 2022

Many "Python" packages include native code in some form either as bindings or to workaround Python being agonizingly slow. Which means you often need to call make, or cmake or some other build system anyway... unless you want to build wheels for every possible configuration a user might have (which is virtually impossible, considering every combination of OS, architecture, debug options, etc. you may want to support). Plus you need a build system to build the wheels anyway...

x3n0ph3n3 · on Sept 8, 2022

I suggested "one packaging system to rule them all." The fragmentation in this space is frustrating.

baggiponte · on Sept 7, 2022

I recommend PDM over poetry!