Every time I see an article like this, it's always missing --- but is it any good, is it correct? They always show you the part that is impressive - "it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach."
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
You could trust the expert analysis of people in that field. You can hit personal ideologies or outliers, but asking several people seems to find a degree of consensus.
You could try varying tasks that perform complex things that result in easy to test things.
When I started trying chatbots for coding, one of my test prompts was
Create a JavaScript function edgeDetect(image) that takes an ImageData object and returns a new ImageData object with all direction Sobel edge detection.
That was about the level where some models would succeed and some will fail.
Recently I found
Can you create a webgl glow blur shader that takes a 2d canvas as a texture and renders it onscreen with webgl boosting the brightness so that #ffffff is extremely bright white and glowing,
Produced a nice demo with slider for parameters, a few refinements (hierarchical scaling version) and I got it to produce the same interface as a module that I had written myself and it worked as a drop in replacement.
These things are fairly easy to check because if it is performant and visually correct then it's about good enough to go.
It's also worth noting that as they attempt more and more ambitious tasks, they are quite probably testing around the limit of capability. There is both marketing and science in this area. When they say they can do X, it might not mean it can do it every time, but it has done it at least once.
> You could trust the expert analysis of people in that field
That’s the problem - the experts all promise stuff that can’t be easily replicated. The promises the experts send doesn’t match the model. The same request might succeed and might fail, and might fail in such a way that subsequent prompts might recover or might not.
That's how working with junior team members or open source project contributors goes too. Perhaps that's the big disconnect. Reviewing and integrating LLM contributions slotted right into my existing workflow on my open source projects. Not all of them work. They often need fixing, stylistic adjustments, or tweaking to fit a larger architectural goal. That is the norm for all contributions in my experience. So the LLM is just a very fast, very responsive contributor to me. I don't expect it to get things right the first time.
But it seems lots of folks do.
Nevertheless, style, tweaks, and adjustments are a lot less work than banging out a thousand lines of code by hand. And whether an LLM or a person on the other side of the world did it, I'd still have to review it. So I'm happy to take increasingly common and increasingly sophisticated wins.
It's gotten more and more shippable, especially with the latest generation (Codex 5.1, Sonnet 4.5, now Opus 4.5). My metric is "wtfs per line", and it's been decreasing rapidly.
My current preference is Codex 5.1 (Sonnet 4.5 as a close second, though it got really dumb today for "some reason"). It's been good to the point where I shipped multiple projects with it without a problem (with eg https://pine.town being one I made without me writing any code).
It's very good but it feels kind of off-the-rails in comparison to Sonnet 4.5 - at least with Cursor it does strange things like putting its reasoning in comments that are about 15 lines long, deleting 90% of a file for no real reason (especially when context is reaching capacity) and making the same error that I just told it not to do.
I think they get to that a couple of paragraphs later:
> The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns.
I think the point is we’re getting there. These models are growing up real fast. Remember 54% of US adults read at or below the equivalent of a sixth-grade level.
Education is not just a funding issues. Policy choices, like making it impossible for students to fail which means they have no incentive to learn anything, can be more impactful.
As far as I understand it, the problem isn’t that teachers are shit. Giving more money would bring in better teachers, but I don’t know that they’d be able to overcome the other obstacles
Because education alone in a vacuum won't fix the issues.
Even if the current model was working, just continuing to invest money in it while ignoring other issues like early childhood nutrition, a good and healthy home environment, environmental impacts, etc. will just continue to fail people.
Schooling alone isn't going to help the kid with a crappy home life, with poor parents who can't afford proper nutrition, and without the proper tools to develop the mindset needed to learn (because these tools were never taught by the parents, and/or they are too focused on simply surviving).
We, as a society, need to stop allowing people to be in a situation where they can't focus on education because they are too focused on working and surviving.
New Mexico (where I live) is dead last in education out of all 50 states. They are currently advertising for elementary school teachers between 65-85K per year. Summers off. Nice pension. In this low cost of living state that is a very good salary, particularly the upper bands.
In WA they always pass levies for education funding at local and state level however results are not there.
Mississipi is doing better on reading, the biggest difference being that they use phonics approach to teaching how to read, which is proven to work, whereas WA uses whole language theory (https://en.wikipedia.org/wiki/Whole_language), which is a terrible idea I don't know how it got traction.
So the gist of it, yes, spend on education, but ensure that you are using the right tools, otherwise it's a waste of money.
First time hearing of whole language theory, and man, it sounds ridiculous. Sounds similar to the old theory that kids who aren't taught a language at all will simply speak perfect Hebrew.
In my own social/family circle, there’s no correlation between net worth and how someone leans politically. I’ve never understood why given the pretty obvious pros/cons (amount paid in taxes vs. benefits received)
The people most vociferously for conservative values are middle class, small business owners, or upper class, though the true upper class are libertine (notice who participated in the Epstein affair). The working class is filled with all kinds of very diverse people united by the fact they have to work for a living and often can't afford e.g. expensive weddings. Some of them are religious, a whole bunch aren't. It's easy to be disillusioned with formal institutions that seem to not care at all about you.
Unfortunately, a lot of these people have either concluded it is too difficult to vote, can't vote, or that their votes don't matter (I don't think they're wrong). Their unions were also destroyed. Some of them vote against their interests, but it's not clear that their interests are ever represented, so they vote for change instead.
It's not just investing in education, it's using tools proven to work.
WA spends a ton of money on education, and on reading Mississipi, the worst state for almost every metric, has beaten them.
The difference?
Mississipi went hard on supporting students and using phonics which are proven to work. WA still uses the hippie theory of guessing words from pictures (https://en.wikipedia.org/wiki/Whole_language) for learning how to read.
Unfortunately, people are born with a certain intellectual capacity and can't be improved beyond that with any amount of training or education. We're largely hitting peoples' capacities already.
We can't educate someone with 80 IQ to be you; we can't educate you (or I) into being Einstein. The same way we can't just train anyone to be an amazing basketball player.
You don't need an educated workforce if you have machines that can do it reliably. The more important question is: who will buy your crap if your population is too poor due to lack of well paying jobs? A look towards England or Germany has the answer.
For what it's worth I have been using Gemini 2.5/3 extensively for my masters thesis and it has been a tremendous help. It's done a lot of math for me that I couldn't have done on my own (without days of research), suggested many good approaches to problems that weren't on my mind and helped me explore ideas quickly. When I ask it to generate entire chapters they're never up to my standard but that's mostly an issue of style. It seems to me that LLMs are good when you don't know exactly what you want or you don't care too much about the details. Asking it to generate a presentation is an utter crap shoot, even if you merely ask for bullet points without formatting.
Truth is you still need human to review all of it, fix it where needed, guide it when it hallucinate and write correct instructions and prompts.
Without knowledge how to use this “PROBALISTIC” slot machine to have better results ypu are only wasting energy those GPUs need to run and answer questions.
Majority of ppl use LLMs incorrectly.
Majority of ppl selling LLMs as a panacea for everyting are lying.
But we need hype or the bubble will burst taking whole market with it, so shuushh me.
Child Mind Institute | Full-Time | Remote within US or NYC Hybrid | https://childmind.org/about-us/careers/
Join us in transforming the lives of children struggling with mental health or learning disorders.
These positions are focused on the development of MindLogger (soon to be renamed "Curious"), a data collection platform focused on mental health research. MindLogger is an established platform and we are looking to build out an internal engineering team to support and enhance it. It's a great opportunity to use your engineering skills for a great cause!
Email me with questions jody dot brookover at childmind dot org
Child Mind Institute | New York City or Remote | Full Time | Multiple Roles
The Science and Engineering team at the Child Mind Institute is dedicated to transforming the lives of children with mental health and learning disorders through the power of scientific discovery.
Our product development group is working on a number of products including interventions and data gathering tools.
CareRev (YC S16) | Software Engineers | Fully Remote within the US | Full-Time CareRev’s mission is to seamlessly connect healthcare facilities and professionals. Through our marketplace platform, we offer efficiency, flexibility, and opportunities for growth. Our stack is currently Ruby/Rails, React and Elm, Swift, Kotlin, and Postgres deployed on Heroku. CareRev was recently named a Ycombinator "Top Company".
CareRev (YC S16) | Software Engineers | Fully Remote within the US | Full-Time We are hiring backend, frontend, and android engineers. CareRev’s mission is to seamlessly connect healthcare facilities and professionals. Through our marketplace platform, we offer efficiency, flexibility, and opportunities for growth. Our stack is currently Ruby/Rails, React and Elm, Swift, Kotlin, and Postgres deployed on Heroku.
CareRev was recently named a Ycombinator "Top Company".
CareRev (YC S16) | Software Engineers | Fully Remote within the US | Full-Time
We are hiring backend, frontend, and android engineers. CareRev’s mission is to seamlessly connect healthcare facilities and professionals. Through our marketplace platform, we offer efficiency, flexibility, and opportunities for growth. Our stack is currently Ruby/Rails, React and Elm, Swift, Kotlin, and Postgres deployed on Heroku.
CareRev was recently named a Ycombinator "Top Company".
Backend API Engineer (Ruby on Rails) - Mid, Senior (2+ years exp) Not all levels are posted rn, but just apply to the closest.
Android Engineer (Kotlin) - Senior, Staff (5+ years)
Frontend Engineer (ELM) - Senior+ (5+ years exp) We are committed to using ELM for a significant portion of our web frontend. Come join the fun.
Backend Engineer (Marketing) - Senior (3+ years exp)
Feel free to email me if you have questions: jody[at]carerev[dot]com
CareRev (YC S16) | Software Engineers | Fully Remote within the US | Full-Time
We are hiring backend and android engineers as well as product managers. CareRev’s mission is to seamlessly connect healthcare facilities and professionals. Through our marketplace platform, we offer efficiency, flexibility, and opportunities for growth. Our stack is currently Ruby/Rails, React and Elm, Swift, Kotlin, and Postgres deployed on Heroku.
CareRev was recently named a Ycombinator "Top Company".
Backend API Engineer (Ruby on Rails) - Mid, Senior, Staff, Sr Staff+ (2+ years exp) Not all levels are posted rn, but just apply to the closest.
Android Engineer (Kotlin) - Senior, Staff
Frontend Engineer (ELM) - Senior+ (5+ years exp) We are committed to using ELM for a significant portion of our web frontend. Come join the fun.
Data Engineer (Kafka) - Senior (5+ years exp)
Feel free to email me if you have questions: jody[at]carerev[dot]com
CareRev (YC S16) | Software Engineers | Fully Remote within the US | Full-Time
We are hiring backend and android engineers as well as product managers. CareRev's mission is to seamlessly connect healthcare facilities and professionals. Through our marketplace platform, we offer efficiency, flexibility, and opportunities for growth. Our stack is currently Ruby/Rails, React and Elm, Swift, Kotlin, and Postgres deployed on Heroku.
Find our careers page with all our postings here - https://grnh.se/072b12f63us
Backend API Engineer (Ruby on Rails) - Mid, Senior, Staff, Sr Staff+ (2+ years exp) Not all levels are posted rn, but just apply to the closest.
Frontend Engineer (ELM) - Senior+ (5+ years exp) We are committed to using ELM for a significant portion of our web frontend. Come join the fun.
Product Managers - Principal (doesn't manage others), Director (manages others), and Technical PM (7+ years PM exp, 2yrs mgmt exp)
Feel free to email me if you have questions: jody[at]carerev[dot]com
Then it goes on, "After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper." I hear..."I got 14 pages of words". But is it a good paper, that another PhD would think is good? Is it even coherent?
When I see the code these systems generate within a complex system, I think okay, well that's kinda close, but this is wrong and this is a security problem, etc etc. But because I'm not a PhD in these subjects, am I supposed to think, "Well of course the 14 pages on a topic I'm not an expert in are good"?
It just doesn't add up... Things I understand, it looks good at first, but isn't shippable. Things I don't understand must be great?
reply