I don't know your experience doesn't match mine. NotebookLM by Google is in a cl...

mike_hearn · 2025-04-12T11:23:56 1744457036

Gemini with coding seems to be a bit of a mixed bag.

The article claims Gemini is acing the Aider Polyglot benchmark. At the moment this is the only benchmark that really matters to me because Aider is actually a useful tool and performance on that translates directly to real world impact, although Claude Code is even better. If you look closely, in fact Gemini is at the top only in the "percent correct" category but not "percent correct using the right edit format". Cost is marked as ? because it's not entirely available yet (I think?). Not emitting the correct edit format is pretty useless because it means the changes won't apply and the tool has to try again.

Claude in contrast almost never makes a mistake with emitting the right format. It's at 97%+ in the benchmark, in practice it's ~100% in my experience. This tracks: Claude is really good at following instructions. Gemini is about ~90%. This makes a big difference to how frustrating a tool is to use in practice.

They might get that fixed, but my experience has been that Google's models are consistently much more likely to refuse instructions for dumb reasons. Google is the company with by far the biggest purity spiral problem and it does show up in their output even when doing apparently ordinary tasks.

I'm also concerned by this event: https://news.sky.com/story/googles-ai-chatbot-gemini-tells-u...

Given how obsessed Google claimed to be with AI safety I expected an SRE style postmortem after that, and there was bupkis. An AI that can suffer a psychotic break out of nowhere like that is one I wouldn't trust unless it's behind a very strong sandbox and being supervised very closely, but none of the AI tools today offer much in the way of sandboxing.

ramraj07 · 2025-04-12T10:23:59 1744453439

Time for my next round of Evals then. I had a 40 PR coding streak last weekend with mostly o3-mini-pro, will test the latest 2.5 now.

chasd00 · 2025-04-12T10:52:53 1744455173

PR = pull request? So every bit of garbage from the LLM, over and over, resulted in an individual pull request? Why not just do one when your branch is finally right?

Philpax · 2025-04-12T11:11:43 1744456303

Presumably because they were discrete changes (i.e. new features), and it didn't make sense to group them together.

egeozcan · 2025-04-12T11:18:58 1744456738

Or it could be just microservices. One larger feature affecting 100 repositories.

ramraj07 · 2025-04-12T22:49:32 1744498172

A pull request in my workplace is an actual feature/enhancement/bug-fix. That many PRs means I shipped that many features or enhancements.

I suppose you don't know what a PR is because you likely still work in an environment without modern version control, probably just now migrating your rants from vim vs emacs to crapping on vibe coding.

In my experience, AI today is an intelligence multiplier. A lot of folks just need to look back at the zero they keep multiplying and getting zero back to understand why they don't get the hype.

fuzzy_biscuit · 2025-04-12T12:51:20 1744462280

I would assume they don't like that style, like if they needed to see a specific diff and make changes or remove a commit outright.

fragmede · 2025-04-12T11:58:28 1744459108

Why would you assume that?

retinaros · 2025-04-12T10:30:39 1744453839

in what world notebookLM isnt rag as well?

epolanski · 2025-04-12T13:09:17 1744463357

I thought it leveraged a much larger context over classical rag.