Hacker Newsnew | past | comments | ask | show | jobs | submit | slacktivism123's commentslogin

Fascinating case showing how LLM promoters will happily take "verified" benchmarks at their word.

It's easy to publish "$NEWMODEL received an X% bump in SWE-Bench Verified!!!!".

Proper research means interrogating the traces, like these researchers did (the Gist shows Claude 4 Sonnet): https://gist.github.com/jacobkahn/bd77c69d34040a9e9b10d56baa...

Commentary: https://x.com/bwasti/status/1963288443452051582, https://x.com/tmkadamcz/status/1963996138044096969


The best benchmark is the community vibe in the weeks following a release.

Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly.

(yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.)


the vibes are just a collection anecdotes

"qual"

Yes, often you see huge gains in some benchmark, then the model is ran through Aider's polyglot benchmark and doesn't even hit 60%.

One level up => https://learn.microsoft.com/en-us/windows/win32/msi/roadmap-...:

>Package Validation discusses using Internal Consistency Evaluators (ICEs) to test the internal consistency of installation packages that are under development.

See also => https://docs.flexera.com/adminstudio2021r2/Content/helplibra...:

>The internal consistency evaluators (ICEs) are tests that you can run to check whether Windows Installer packages are valid databases that perform as expected. These tests validate the data in each table of a package, as well as the data among tables.


>Importantly, we never intentionally degrade model quality as a result of demand or other factors, and the issues mentioned above stem from unrelated bugs.

Sure. I give it a few hours until the prolific promoters start to parrot this apologia.

Don't forget: the black box nature of these hosted services means there's no way to audit for changes to quantization and model re-routing, nor any way to tell what you're actually getting during these "demand" periods.


What offends me is a "security scanner" for "ground truth" using fake checksums to verify integrity of its dependencies ;-)

https://github.com/TheAuditorTool/Auditor/commit/f77173a5517...


Yeh, i dont dont use nix so when asked to follow the link? It didnt work as it should. And because i dont use nix? Hard to catch it until my friend did...

That said? Did you the hash fail? Yes it did, security working as intended... Anything more to add? :)


Describing the commercial offerings as "weird and unintuitive" is a weak criticism palatable to corporate comms teams. It suggests a fault in the user ("you're holding it wrong") rather than deficiencies inherent to LLM architecture. No amount of marketing can fix the lethal trifecta or the hallucination problem, can it?

https://www.anthropic.com/solutions/code-modernization:

    Generate dependency graphs, identify dead code, and prioritize refactoring based on code complexity metrics and business impact.
    Transform legacy codebases systematically while maintaining business continuity.
    Claude Code preserves critical business logic while modernizing to current frameworks.
    Claude Code can seamlessly create unit tests for refactored code, identify missing test coverage, and help write regression tests.
    Identify and patch vulnerabilities while maintaining regulatory compliance patterns embedded in legacy systems.
    Create modern documentation from undocumented legacy code, capturing institutional knowledge before it's lost.

OpenAI actually put out an interesting paper on addressing hallucination yesterday, but I've not spent enough time with it to judge how credible it is: https://openai.com/index/why-language-models-hallucinate/

I don't particularly care how these companies market their software - what I care about is figuring out what these things can actually do and what they're genuinely useful for, then helping other people use them in as productive a way as possible given their inherent flaws.


Sounds like HackerOne Managed Triage Services dropped the ball again and closed both reports without even flagging to Cloudflare's security engineers.

This happened in a high-profile way with the Zendesk situation (https://news.ycombinator.com/item?id=41818459) and is not the first time:

    1. Bug bounty report received from knowledgeable person who isn't a "celebrity" (top x performer on H1 leaderboard, social media influencer, H1 event invitee)

    2. with novel impact to the company, open source ecosystem, or wider Internet

    3. which doesn't fall neatly into an OWASP Top 10 (Web) box

    4. so Triage close it in the pre-queue before the company get eyes on it, replying with a zero-effort CR (Common Response aka Canned Response)

    5. the company doesn't see the report unless they go digging for it in the thousands of spam/bullshit/Acunetix copypaste reports that are also closed
---

Timeline of events:

https://blog.cloudflare.com/unauthorized-issuance-of-certifi...

>2025-09-02 04:50:00: Report shared with us on HackerOne, but was mistriaged

>2025-09-03 02:35:00: Second report shared with us on HackerOne, but also mistriaged.

>2025-09-03 10:59:00: Report sent on the public mailing [list] picked up by the team.

---

The canned response in question:

https://groups.google.com/g/certificate-transparency/c/we_8S...

>"after reviewing your submission it appears this behavior does not pose a concrete and exploitable risk to the platform in and on itself.

>If you're able to demonstrate any impact please let us know, and provide an accompanying working exploit."


My HackerOne dismissal reads

"Although your finding might appear to be a security vulnerability, after reviewing your submission it appears this behavior does not pose a concrete and exploitable risk to the platform in and on itself. If you're able to demonstrate any impact please let us know, and provide an accompanying working exploit."

I was disappointed, and as far as I'm concerned, HackerOne is 2/2 dismissals.


The linked page explains:

    Magic Lantern is a free software add-on that runs from the SD/CF card and adds a host of new features to Canon EOS cameras that weren't included from the factory by Canon.
I also found this concise, human-written readme on the project page. Since it's not AI slop churned out by a startup, it's worth reading! :-)))

https://github.com/reticulatedpines/magiclantern_simplified/...

    Magic Lantern
    =============

    Magic Lantern (ML) is a software enhancement that offers increased
    functionality to the excellent Canon DSLR cameras.
      
    It's an open framework, licensed under GPL, for developing extensions to the
    official firmware.

    Magic Lantern is not a *hack*, or a modified firmware, **it is an
    independent program that runs alongside Canon's own software**. 
    Each time you start your camera, Magic Lantern is loaded from your memory
    card. Our only modification was to enable the ability to run software
    from the memory card.

    ML is being developed by photo and video enthusiasts, adding
    functionality such as: HDR images and video, timelapse, motion
    detection, focus assist tools, manual audio controls much more.

    For more details on Magic Lantern please see [http://www.magiclantern.fm/](http://www.magiclantern.fm/)

    There is a sibling repo for our patched version of Qemu that adds support
    for emulating camera ROMs. This allows testing without access to a physical
    camera, and automating tests across a suite of cameras.  
    https://github.com/reticulatedpines/qemu-eos  
    https://github.com/reticulatedpines/qemu-eos/tree/qemu-eos-v4.2.1 (current ML team supported branch)

>Notable because often when people complain of degraded model quality it turns out to be unfounded - Anthropic in the past have emphasized that they don't change the model weights after releasing them without changing the version number.

It's almost as if companies whose bottom line depends on shilling the [CURRENT HOT THING] will lie to you!

Instead of using appeals to authority to silence critics as "conspiracy theorists" spreading "misinformation", the influencers should maybe apply some reasoning and thinking.


Of course it does. Do you really think you have full control over API output? Do you really think "the system prompt you can specify in an API call" is the system prompt and not the developer instructions prompt?


I think the definitive point is that it should be documented. Security through obscurity all by itself isn't security.


    No secret. Just vibes.
Since you know the tells of LLM generated text, you'll know that this is a classic: No X. Just Y.

    Proxyman -- pick your poison.

    And if you're from PureGym reading this—let's talk.
There's a mixture of em dashes joining words and double hyphens spaced between words, suggesting the former were missed in a find and replace job.

"And if you're from [COMPANY] reading this[EM DASH]let's talk" is a classic GPT-ism.

    It's like the API is saying "Hey buddy, I know this is odd, but can you poll me every minute? Thanks, love you too."

    Shame Notifications: "You were literally 100 meters from the gym and walked past it"

    It's just a ZIP archive with delusions of grandeur
Clear examples of fluff. Not only do these fail to "add facts or colour to the story", they actually detract from it.

I agree with you that em dashes in isolation are not indicative, but the prose here is dripping with GPT-speak.


OP here! Appreciate you actually pulling examples instead of just dropping "this is AI".

> There's a mixture of em dashes joining words and double hyphens spaced between words, suggesting the former were missed in a find and replace job.

The em dash conspiracy in the comments today is amazing -- I type double hyphens everywhere, and some apps (e.g a Telegram bot I made for drafts, or the macOS' built-in auto-correct) replace them with em dashes automatically–I never bother to edit those out (ok, now this one I put here on purpose).

> It's just a ZIP archive with delusions of grandeur > Clear examples of LLM fluff that don't "add facts or colour to the story".

Yeah, no that's fair enough, should've known better than to attempt humour on HN.

I've got to say though, pkpass is a ZIP archive, and no ZIP archive should require one to spend 3 hours to sign it.


I enjoyed the humour. (We’re heading towards a sad world if any attempt at levity in an article is interpreted as evidence of LLM usage by critical killjoys.)

Edit: total random thought: something in your prose shouted ‘Brit’ to me very quickly. Is it possible that part of this is simply cultural differences in humour and writing, and over-interpretation of subtle differences as evidence of LLM use?

Or do LLMs just write in a subtlety more British style because, well, Shakespeare and Dickens and Keats and Milton? Or does ChatGPT just secretly channel PG Wodehouse?


Authors use humour as a form of connection with their audience. It's a way of saying hey I'm a human and I have the same human experiences as you dear reader. Take the first paragraph for example:

> Wednesday, 11:15 AM. I'm at the PureGym entrance doing the universal gym app dance. Phone out, one bar of signal that immediately gives up because apparently the building is wrapped in aluminum foil

It says, "Hey I'm a human who goes to the gym and experiences the same frustrations as you do". Now imagine for a second this paragraph was written by AI. The AI has never been to the gym, the AI doesn't feel impatience trying to pass through the turnstile, the AI has never experienced the anxiety of a dodgy internet connection in a large commercial building. The purpose of any humour in this paragraph is completely undermined if you assume it was actually written by AI.

So please don't conflate being anti-LLM with being anti-humour. It's just the opposite. We want humour because we want to feel a connection with our fellow humans and for the same reason we should also want writing that comes from a human, not a machine.


> So please don't conflate being anti-LLM with being anti-humour. It's just the opposite.

I'm not.

I'm trying to analyse, or hypothesise, why this author's particular writing style seemed to trigger people's nascent LLM warning heuristics.

I considered the humour, because, well, other people brought it up. From the surrounding discussion, it seemed that the jocular writing style was one of the points generating suspicion.


Does sound like some people just don't get the humour which is fine, personally I liked it (but then I am british).

British people do tend to have a fairly humorous indirect way of communicating that can take some getting used to for people from other cultures, but that doesn't mean we're all secretly LLMs


FWIW, I found "It's just a ZIP archive with delusions of grandeur" pretty funny and for me it was an example of a human adding (relevant) colour to the content.

I swear some folks have just been normalised to the shit writing that AI does so much that they look for tricks like punctuation rather than just reading the damn text. Although maybe they're just blatting the whole thing into ChatGPT and asking it to summarise, or determine if it's AI generate.


FWIW I enjoyed the article and the humour, and I don't know where the AI conspiracy is coming from – I wish I could get the AI to write copy this good. So thanks, that was a fun read!


> I don't know where the AI conspiracy is coming from

It has become a trope to call AI writing to any text which includes an em-dash.


not sure why you're being downvoted here, you're completely right


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: