I haven't personally tried the specific tool that you have, but I have tried a variety of other tools and have had pretty negative experiences with them. I have received a lot of feedback telling me that if I tried out an agentic tool (or a different model, or etc etc etc, as I covered in the post, the goal posts are endlessly moving) I would like it, because the workflow is different.
I was deliberately vague about my direct experiences because I didn't want anyone to do… well, basically this exact reply, "but you didn't try my preferred XYZ workflow, if you did, you'd like it".
What I saw reflected in your repo history was the same unpleasantness that I'd experienced previously, scaled up into a production workflow to be even more unpleasant than I would have predicted. I'd assumed that the "agentic" stuff I keep hearing about would have reduced this sort of "no you screwed up" back-and-forth. Made particularly jarring was that it was from someone for whom I have a lot of respect (I was a BIG fan of Sandstorm, and really appreciated the design aesthetic of Cap'n Proto, although I've never used it).
As a brutally ironic coda about the capacity of these tools for automated self-delusion at scale, I believed the line "Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.", and in the post, I accepted the premise that it worked. You're not a novice here, you're on a short list of folks with world-class appsec chops that I would select for a dream team in that area. And yet, as others pointed out to me post-publication, CVE-2025-4143 and CVE-2025-4144 call into question the efficacy of "thorough review" as a mechanism to spot the sort of common errors likely to be generated by this sort of workflow, that 0xabad1dea called out 4 years ago now: https://gist.github.com/0xabad1dea/be18e11beb2e12433d93475d7...
Having hand-crafted a few embarrassing CVEs myself with no help from an LLM, I want to be sure to contextualize the degree to which this is a "gotcha" that proves anything. The main thrust of the post is that it is grindingly tedious to conclusively prove anything at all in this field right now. And even experts make dumb mistakes, this is why the CVE system exists. But it does very little to disprove my general model of the likely consequences of scaled-up LLM use for coding, either.
I do feel that the agentic thing is what made all the difference to me. The stuff I tried before that seemed pretty lame. Sorry, I know you were trying to avoid that exact comment, but it is true in my case. To be clear, I am not saying that I think you will like it. Many people don't, and that's fine. I am just saying that I didn't think I would like it, and I turned out wrong. So it might be worth trying.
The CVE is indeed embarrassing, particularly because the specific bug was on my list of things to check for... and somehow I didn't. I don't know what happened. And now it's undermining the whole story. Sigh.
I appreciate your commitment to being open to the possibility of being surprised. And I do wish I _could_ find a context in which I could be comfortable doing this type of personal experiment. But, I do remain confident in my own particular course of action chosen in the face of incomplete information.
Again, it's tough to talk about this while constantly emphasizing that the CVE at best a tiny little data point, not anywhere close to a confirmation bullseye, but my model of this process would account for it. And the way it accounts for it is in what I guess I need to coin a term for, "vigilance decay". Sort of like alert fatigue, except there are no alerts, or hedonic adaptation, for when you're not actually happy. You need to keep doing the same kinds of checks, over and over, at the same level of intensity forever to use one of these tools, and humans are super bad at that; so, at some point in your list, you developed the learned behavior "hey, this thing is actually getting most of this stuff right, I am going to be a little less careful". Resisting this is nigh impossible. The reason it's less of a problem with human code review is that as the human seems to be getting better at not making the mistakes you've spotted before, they actually are getting better at not making those mistakes, so your relaxed vigilance is warranted.
Martin writes on Mastodon: “So apparently dang and the HN crowd are so upset I wrote some messages for HN visitors to our website, that they now banned my home IP address ”
Also, if you really want a "fancy checksum" for verifying package installs, there's already a better feature in pip that actually works and is well-supported: https://pip.pypa.io/en/stable/topics/secure-installs/
There are other efforts underway to mitigate these threats (which could be subject to their own critiques, but let's not get into that here) but PGP has had 20 years to prove its utility in this area and it has resoundingly proved that it (A) does not address the threats it purports to and (B) introduces tons of confusing complexity into processes which are not benefiting from it.
Let me restate that: it is not free to continue supporting PGP. It has a tremendous cost both in its own maintenance and its opportunity cost. Every moment spent attempting to mitigate its fundamentally broken design is a moment that could instead be put into designing something new, that works properly and doesn't require dragging around the massively bloated corpse of 1999-era cryptographic engineering.
To start with, the non-`@dataclass` version here doesn't tell you what types `name` and `age` are (interesting that it's an int, I would have guessed float!). So right off the bat, not only have you had to type every name 3 times, you've also provided me with less information.
> I'm not writing an eq method or a repr method in most cases, so it just doesn't add much for the cost.
That's part of the appeal. With vanilla classes, `__repr__`, `__eq__`, `__hash__` et. al. are each an independent, complex choice that you have to intentionally make every time. It's a lot of cognitive overhead. If you ignore it, the class might be fit for purpose for your immediate needs, but later when debugging, inspecting logs, etc, you will frequently have to incrementally add these features to your data structures, often in a haphazard way. Quick, what are the invariants you have to verify to ensure that your `__eq__`, `__ne__`, `__gt__`, `__le__`, `__lt__`, `__ge__` and `__hash__` methods are compatible with each other? How do you verify that an object is correctly usable as a hash key? The testing burden for all of this stuff is massive if you want to do it correctly, so most libraries that try to eventually add all these methods after the fact for easier debugging and REPL usage usually end up screwing it up in a few places and having a nasty backwards compatibility mess to clean up.
With `attrs`, not only do you get this stuff "for free" in a convenient way, you also get it implemented in a way which is very consistent, which is correct by default, and which also provides an API that allows you to do things like enumerate fields on your value types, serialize them in ways that are much more reliable and predictable than e.g. Pickle, emit schemas for interoperation with other programming languages, automatically provide documentation, provide type hints for IDEs, etc.
Fundamentally attrs is far less code for far more correct and useful behavior.
Not that attrs or dataclasses has particularly significant attack surface, but when considering stdlib vs. 3rd-party you also have to consider the amount of maintenance and the release cadence. Attrs can release every few months if the rate of change demands it, whereas the stdlib has a fixed yearly release schedule that is tied to interpreter versions. Attrs has a small, focused development team whereas the stdlib is maintained by developers who are stretched very thin, and many packages within it are effectively abandoned. Upgrading dataclasses means upgrading everything in the stdlib at the same time, whereas attrs can be upgraded independently, by itself.
Supply chain attacks are a complex and nuanced topic so there are plenty of reasons to be thoughtful about adopting new dependencies, but it's definitely not as simple as "just use the stdlib for everything".
To the runtime-validation point; our team used attrs with runtime validation enforced everywhere (we even wrote our own wrapper to make it always use validation, with no boilerplate) and this ended up being a massive performance hit, to the point where it was showing up close to the top of most profile stats from our application. Ripping all that out made a significant improvement to interactive performance, with zero algorithmic improvements anywhere else. It really is very expensive to do this type of validation, and we weren't even doing "deep" validation (i.e. validating that `list[int]` really did include only `int` objects) which would have been even more expensive.
Python can be used quite successfully in high-performance environments if you are judicious about how you use it; set performance budgets, measure continuously, make sure to have vectorized interfaces, and have a tool on hand, like PyO3, Cython, or mypyc (you should probably NOT be using C these days, even if "rewrite in C" is the way this advice was phrased historically) ready to push very hot loops into something with higher performance when necessary. But if you redundantly validate everything's type on every invocation at runtime, it does eventually become untenable for anything but slow batch jobs if you have any significant volume of data.
This is why consensus is a collection. Sometimes dialectical analysis is good; thesis, antithesis, synthesis is a tried and true formula that often works. But not always! Sometimes one side is just horseshit that gets mindlessly repeated in the interest of “balance”. In either case I’m not suggesting that the consensus ruthlessly censor dissenting views, rather that in order to make sense out of dissenting views, the strongest forms of each argument need to be presented together alongside accountability: editorial moderation and fact-checking.
Even in the cases where the truth really is somewhere in the middle between two opposing camps, reading a sequence of side A #1, side B #1, side A #2, side B #2, in disconnected stories gives you a very skewed view subject to recency bias. For example you can’t easily check the history to see if a claim B is making in their second story was already debunked by A in their first one, and it’s a huge waste of time and energy for A to have to spend all their media budget just refuting that claim over and over because B keeps bringing it up every time there’s no fact checker right in front of them to call them on it (and even sometimes if there is).
In other words the current media environment rewards being loud, wrong, simple and repetitive far over and above even the normal human bias for such things. It reinforces our worst cognitive habits.
The results of the poll are in, and 59.6% of the 699 respondents said that they're either using it already or will be using it within the year. This isn't necessarily a representative sample, my followers are probably on the more advanced end of the Python spectrum, but 700 people (including myself) isn't nothing, either. I think it's likely that type annotations will be available for the majority of Python libraries.
I was deliberately vague about my direct experiences because I didn't want anyone to do… well, basically this exact reply, "but you didn't try my preferred XYZ workflow, if you did, you'd like it".
What I saw reflected in your repo history was the same unpleasantness that I'd experienced previously, scaled up into a production workflow to be even more unpleasant than I would have predicted. I'd assumed that the "agentic" stuff I keep hearing about would have reduced this sort of "no you screwed up" back-and-forth. Made particularly jarring was that it was from someone for whom I have a lot of respect (I was a BIG fan of Sandstorm, and really appreciated the design aesthetic of Cap'n Proto, although I've never used it).
As a brutally ironic coda about the capacity of these tools for automated self-delusion at scale, I believed the line "Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.", and in the post, I accepted the premise that it worked. You're not a novice here, you're on a short list of folks with world-class appsec chops that I would select for a dream team in that area. And yet, as others pointed out to me post-publication, CVE-2025-4143 and CVE-2025-4144 call into question the efficacy of "thorough review" as a mechanism to spot the sort of common errors likely to be generated by this sort of workflow, that 0xabad1dea called out 4 years ago now: https://gist.github.com/0xabad1dea/be18e11beb2e12433d93475d7...
Having hand-crafted a few embarrassing CVEs myself with no help from an LLM, I want to be sure to contextualize the degree to which this is a "gotcha" that proves anything. The main thrust of the post is that it is grindingly tedious to conclusively prove anything at all in this field right now. And even experts make dumb mistakes, this is why the CVE system exists. But it does very little to disprove my general model of the likely consequences of scaled-up LLM use for coding, either.