Exactly. It's getting to the point where the quality of the top AI labs are either not ground-breaking (except Google Gemini Diffusion) and labs are rushing to announce their underwhelming models. Llama as an example.
Now in the next 6 months, you'll see all the AI labs moving to diffusion models and keep boasting around their speed.
People seem to forget that Google Deepmind can do more than just "LLMs".
I mean I'm gonna say this with the hype settling down. But it's pretty on par with visually Kling 2 and Veo 2, it happens to output sound pretty ok but having it be one general output along with the visuals is the gamechanger. Beyond that, eh. I've kinda seen people try to take it to the limit and it's pretty much what you'd expect still from their last model
I think veo2 and Kling are very strong models, but the fact that veo3 is end2end video/audio including lipsync and all other sound is definitely a step change to what came before and I think you're underselling it.
I also expect Google to drive veo forward quite significantly, given the absurd amount of video training data that they sit on.
And compared to the cinemagraph level of video generation we were just 1-2 years ago, boy we've come a long way in very short amount of time.
Lastly, absurd content like this https://youtu.be/jiOtSNFtbRs crosses the threshold for me on what I would actually watch more of.
Veo3 level tech alone will decimate production houses, and if the trajectory holds a lot of people working in media production are in for a rude awakening.
I'm not underselling it. I'm reminding people who get swept up by headlines to actually use the products and be an objective judge of quality when it comes to these things. Because when you lose that objectivity, you start saying things like what you just said. Veo 3 level tech is basically Kling 2/Veo 2 fidelity with native sound generation, so was it that the last generation of these things were already decimating production houses? Be for real. With the tech they had 6 months ago, all they needed to do was add sound manually, which they could have pretty much also generated. A new layer of abstraction isn't "decimating" anything. I'd really take it easy from professing things like that. These things are great for what they are, but let's be actual objective consumers and not fall for these talking points of "oh industries are gonna change in x-months".
You are underselling it because you make it sound like all the model adds is some foley, when in fact it adds facial animations that are in line with the dialogue spoken. Go ahead and create a Kling render that I only need to add VO to, you can't because Kling doesn't do that. You need a Omnihuman level model (or veo3) for that and it makes all the difference.
Happy to agree to disagree, but imo this absolutely is a step change.
Dude. Have you been paying attention to even the first Veo or even the first few iterations of Kling? They've HAD facial expressions that follow the prompt pretty well. You're being fooled by your own senses now because now you can't think they've existed before speech and sound effects have been integrated into the output. They've been there. You just couldn't hear what they were saying. You're paying attention now to how the words they are speaking make sense because lipsync actually adds relevant context to the output. But people have been making similar outputs just with a different workflow prior to this.
I don't need to create anything for you. Go visit r/aivideo and go look at the Kling or even the Hailuo Minimax (admittedly worse in fidelity) attempts. Some of them have been made to even sing or do podcasts. Again. They've been there for at least 6-10 months ago, this happens to generate it as one output. It's not nothing, but this really exposes a lot of the people who aren't familiar with this space when they keep overestimating things they've probably seen a months ago. Somewhat accurate expressions? Passable lipsyncing? All there. Even with the weaker models like Runway and Hailuo.
Again. Use the products. You'll know. Hobbyists have been on it for quite sometime already. Also. I didn't say they were just adding foley, though I can argue the quality of the sound they're adding, that's not my point. My point is, is that everytime something like this comes out there's always people ready to speak on "what industries such thing can destroy right now" before using the thing. It's borderline deranged.
I just ran a few experiments through Kling 2.0 Pro and none of the generations align with the prompt to the degree that you could easily lipsync it, at all. "Pretty well" doesn't cut it for that, and I've been following the aivideo sub since its inception. There are two models right now that can do convincing lipsync that doesn't look like trash or aligns with the prompt "pretty well": omnihuman/dreamina and veo3. That's it. At most you could run a second pass with something like LivePortrait, but even that is a rung below the quality of SOTA.
That said, I don't need to convince you, you go ahead and see what you want to see.
This latest generation will trigger a seismic shift, not "maybe in the future when the models improve", right now.
Good job buddy. You compared your first few prompt attempts on the RNG machine vs the cherrypicked outputs of other people. But also, I genuinely think you're pulling this argument away from what it was. Tell me if you can see a fidelity improvement from what the last few videogen products that came out. I can link pretty much two videos off-rip from that sub regarding this lipsync thing you seem to be honing in on.
The things you'd profess that "will change the world" have been here. It takes maybe an extra one step, but the quality's been comparable. Yet they haven't 6 months ago. Or a month ago. Why's that? Is it perhaps, that people have a habit of overestimating how much use they can get out of these things in their current state like you are?