over the last week or so I have put probably close to 70 hours into playing around with cursor and claude code and a few other tools (its become my new obsession). I've been blown away by how good and reliable it is now. That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models. I dont care what any benchmark says because the only thing that actually matters is actual use. I'm really hoping that this new gpt model actually works for this usecase because competition is great and the price is also great.
I think some of this might come down to stack as well. I watched a t3.gg video[1] recently about Convex[2] and how the nature of it leads to the AI getting it right first time more often. I've been playing around with it the last few days and I think I agree with him.
I think the dev workflow is going to fundamentally change because to maximise productivity out of this you need to get multiple AIs working in parallel so rather than just jumping straight into coding we're going to end up writing a bunch of tickets out in a PM tool (Linear[3] looks like it's winning the race atm) and then working out (or using the AI to work out) which ones can be run in parallel without causing merge conflicts and then pulling multiple tickets into your IDE/Terminal and then cycling through the tabs and jumping in as needed.
Atm I'm still not really doing this but I know I need to make the switch and I'm thinking that Warp[4] might be best suited for this kind of workflow, with the occasional switch over to an IDE when you need to jump in and make some edits.
Oh also, to achieve this you need to use git worktrees[5,6,7].
On a desktop browser, tap YouTube's "show transcript" and "hide timecodes", then copy-paste the whole transcript into Claude or chatgpt and tell it to summarize with whatever resolution you want-a couple sentences, 400 lines, whatever. You can also tell it to focus on certain subject material.
This is a complete game changer for staying on top of what's being covered by local government meetings. Our local bureaucrats are astounding competent at talking about absolutely nothing for 95% of the time, but hidden is three minutes of "oh btw we're planning on paving over the local open space preserve to provide parking for the local business".
1.5x and 2x speed help a lot, slow down or repeat segments as needed, don't be afraid to fast forward past irrelevant looking bits (just be eager to backtrack).
If it can produce something you can read in 20 minutes, it means there was a lot of... 'fluff' isn't quite the right word, but material that could be removed without losing meaning.
Adding yet another comment as you can also call agents from Linear directly, who will create pull requests in github, but they seem pretty expensive for what they are. They don't seem to offer any real benefit over setting up the mcp server, opening a terminal window and typing "create a pr for $TICKET NUMBER in Linear" other than shaving off a few seconds.
> That said the reality is in my experience the only models that actually work in any sort of reliable way are claude models.
Anecdotally, the tool updates in the latest Cursor (1.4) seem to have made tool usage in models like Gemini much more reliable. Previously it would struggle to make simple file edits, but now the edits work pretty much every time.
How much of the product were you able to build to say it was good/reliable? IME, 70 hours can get you to a PoC that "works", building beyond the initial set of features — like say a first draft of all the APIs — does it do well once you start layering features?
It depends on how you use it. The "vibe-coding" approach where you give the agen naive propmts like "make new endpoint" often don't work and fail.
When you break the problem of "create new endpoint" down into its sub-components (Which you can do with the agent) and then work on one part at a time, with a new session for each part, you generally do have more success.
The more boilerplate-y the part is, the better it is. I have not really found one model that can yet reliably one-shot things in real life projects, but they do get quie close.
For many tasks, the models are slower than what I am, but IMO at this point they are helpful and definitely should be part of the toolset involved.
> The more boilerplate-y the part is, the better it is. I have not really found one model that can yet reliably one-shot things in real life projects, but they do get quie close.
This definitely feels right from my experience. Small tasks that are present in the training data = good output with little effort.
Infra tasks (something that isn't in the training data as often) = sad times and lots of spelunking (to be fair Gemini has done a good job for me eventually, even though it told me to nuke my database (which sadly, was a good solution)).
I've been trying it out with openai codex over the last day and a half and I have been incredibly impressed. It has been working quite well. I also had it look over some code that claude produced for me and it said that it would be better to approach it another way and it completely rewrote it in a way that actually was significantly better. The UX for codex is quite a bit worse than Claude Code, but the model has been good enough to justify the switch for now. I'm hopeful that cursor cli will eventually have a good enough ux such that I can switch to it and have access to all of the models rather than needing to use disparate tools for everything. I would strongly suggest you check out gpt 5 for agentic stuff if you are interested.
I find that OpenAI's reasoning models write better code and are better at raw problem solving, but Claude code is a much more useful product, even if the model itself is weaker.