Hey, yeah, this is a fun idea. I built a little toy llm-tdd loop as a Saturday morning side project a little while back: https://github.com/zephraph/llm-tdd.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.
This is awesome! I built a similar tool as an experiment while at Recurse: https://github.com/zephraph/webview. Didn’t really do any heavy lifting though, just reused some of Tauri’s crates. Does Bun run on the same process as the GUI binding? OSX steals the main thread when rendering a native window which made me lean towards separating the processes. Still wonder if there’s a better way.
I’m using bun for the main process. Bun runs a zig binary which can call objc/c methods. So the “main native application thread” is technically the zig process.
Then there’s all kinds of fancy rpc between bun and zig and between bun and browser contexts.
The calendar is available in the sidebar (on desktop) which I tend to use quite often. On mobile if you swipe down it should hide the keyboard and you can switch to the calendar while keeping your draft open on the email tab.
I generally agree that the workflow could be improved though.
I've been wondering the same! Haven't really had the time to dig into stable content addressing (and I assume the loose semantics of something like JavaScript would make that exceedingly hard).
Maybe? At the AST level, it can might be complicated I guess, but not really. At runtime JIT though… yeah sure. The various expressions of the same AST are bountiful.
But I would love to run an analysis on every npm module published, and find the same AST subexpressions, functions, etc. Do the same thing: remove the identifiers and hash the AST parts. Even go back and see how people named the same function in different ways!
I chatted with Eric on my podcast recently. It’s essentially just a special prompting syntax. The thing I found surprising is that it’s quite good at making chatbot like command interfaces. Hallucinations are still a problem but it still does a surprisingly good job of storing state between commands.
I watched the interview and think I get what's going on. In essence, he's been exploring prompt engineering since before it was cool, starting back in 2020. He and others have discovered some of the 'rough edges' of LLMs and have figured out a way to sand them down via prompting. Additionally, they've discovered ways to maximize their abilities, e.g., inference.
The demos are impressive. I'm excited to give it a try, as I have a lot of ideas for personal software tools where I'm the only customer, but not enough time or skill to build them myself.
Anytime Lua comes up I'm reminded of John Earnest's Lil scripting language[1]. It's inspired by Lua and Q, built for Decker[2] which is a re-imagined version of HyperCard. Generally though, I love lua for its embeddability and am extremely happy anytime I see someone chatting about integrating it. Modding and scripting in games was a tremendous motivation for me to dig more into programming and these approachable languages are a core aspect of that.
This is incredible. I’ve been thinking a lot about personal software systems and a hypothesis that I have is that simpler code sharing/versioning mechanisms is key to greater agency in programming environments. I’m already a huge fan of unison (we had Rúnar on devtoolsfm) so I’m eager to explore scrap further.
This doesn't actually work out that well in practice though because the implementations the llm tended to generate were highly specific to pass the tests. There were several times it would cheat and just return hard coded strings that matched the expects of the tests. I'm sure better prompt engineering could help, but it was a fairly funny outcome.
Something I've found more valuable is generating the tests themselves. Obviously you don't wholesale rely on what's generated. Tests can have a certain activation energy just to figure out how to set up correctly (especially if you're in a new project). Having an LLM take a first pass at it and then ensuring it's well structured and testing important codepaths instead of implementation details makes it a lot faster to write tests.