NotebookLM by Google is in a class of its own in the use case of "provide documents and ask a chat or questions about them" for personal use. ChatGPT and Claude are nowhere near. ChatGPT uses RAG so it "understands" less about the topic and sometimes hallucinate.
When it comes to coding Claude 3.5/3.7 embedded in Cursor or stand alone kept giving better results in real world coding, and even there Gemini 2.5 blew it away in my experience.
Antirez, hping and Redis creator among many others releases a video on AI pretty much every day (albeit in Italian) and his tests where Gemini reviews his PRs for Redis are by far the better out of all the models available.
Gemini with coding seems to be a bit of a mixed bag.
The article claims Gemini is acing the Aider Polyglot benchmark. At the moment this is the only benchmark that really matters to me because Aider is actually a useful tool and performance on that translates directly to real world impact, although Claude Code is even better. If you look closely, in fact Gemini is at the top only in the "percent correct" category but not "percent correct using the right edit format". Cost is marked as ? because it's not entirely available yet (I think?). Not emitting the correct edit format is pretty useless because it means the changes won't apply and the tool has to try again.
Claude in contrast almost never makes a mistake with emitting the right format. It's at 97%+ in the benchmark, in practice it's ~100% in my experience. This tracks: Claude is really good at following instructions. Gemini is about ~90%. This makes a big difference to how frustrating a tool is to use in practice.
They might get that fixed, but my experience has been that Google's models are consistently much more likely to refuse instructions for dumb reasons. Google is the company with by far the biggest purity spiral problem and it does show up in their output even when doing apparently ordinary tasks.
Given how obsessed Google claimed to be with AI safety I expected an SRE style postmortem after that, and there was bupkis. An AI that can suffer a psychotic break out of nowhere like that is one I wouldn't trust unless it's behind a very strong sandbox and being supervised very closely, but none of the AI tools today offer much in the way of sandboxing.
PR = pull request? So every bit of garbage from the LLM, over and over, resulted in an individual pull request? Why not just do one when your branch is finally right?
A pull request in my workplace is an actual feature/enhancement/bug-fix. That many PRs means I shipped that many features or enhancements.
I suppose you don't know what a PR is because you likely still work in an environment without modern version control, probably just now migrating your rants from vim vs emacs to crapping on vibe coding.
In my experience, AI today is an intelligence multiplier. A lot of folks just need to look back at the zero they keep multiplying and getting zero back to understand why they don't get the hype.
NotebookLM by Google is in a class of its own in the use case of "provide documents and ask a chat or questions about them" for personal use. ChatGPT and Claude are nowhere near. ChatGPT uses RAG so it "understands" less about the topic and sometimes hallucinate.
When it comes to coding Claude 3.5/3.7 embedded in Cursor or stand alone kept giving better results in real world coding, and even there Gemini 2.5 blew it away in my experience.
Antirez, hping and Redis creator among many others releases a video on AI pretty much every day (albeit in Italian) and his tests where Gemini reviews his PRs for Redis are by far the better out of all the models available.