Hacker News new | past | comments | ask | show | jobs | submit login
How OpenAI's Sora Model Works (factorialfunds.com)
79 points by mplappert 9 months ago | hide | past | favorite | 23 comments



A great write up.

The elephant in the room, of course, is "where did Sora's dataset come from?"


This is only a made-up issue for a few that are looking for something to criticize. Almost nobody cares, in the sense that appears to be meant here about "ownership" of the training data. Any this unfortunately hampers research and understanding of models because companies are reluctant to talk about training lest the trolls start jumping on. We're all worse off because of this.


Is it really so hard to imagine that someone might not like the idea of their work being used to train a machine to imitate their work with no compensation to them? And that machine is then instead used to benefit the shareholders of large corporations?

The position that this is a made-up issue when there are multiple large pending lawsuits about exactly this thing is pretty bizarre.


What a short-sighted view.

Having open access to the training data is how you prevent poisoning/biasing of the dataset. People complaining about bad data in the dataset improve the quality of the dataset. That's in addition to the benefit of creators being labeled in the dataset.

Hiding the data from public view seems to only helps nefarious actors.


Pretty sure we're saying the same thing


> Any this unfortunately hampers research and understanding of models because companies are reluctant to talk about training lest the trolls start jumping on

Respecting artists and being open about training data should go hand in hand. That companies feel the need to hide the training data from public scrutiny should immediately be suspect.

It seems like you are saying no one cares about copyright, I inform you that is not the case. I disagree with (most current forms of) copyright, but I do respect artists and their need to feed themselves. Proper attribution, and labeling and scrutiny of the dataset is imperative.

>This is only a made-up issue for a few that are looking for something to criticize. Almost nobody cares, in the sense that appears to be meant here about "ownership" of the training data

So it's not just 'trolls' that want the data to be open and labeled, is my point. If the companies are hurting artists ('s economic output), that should be examined and fixed (stopped and reattuned attention of said companies).

'trolling (bothering)' a company to be 'good (not against human interests)' isn't a bad thing.


What’s the status on companies building AI models to build actual 3D backend behind these generative videos. Anyone working on something similar? Imagine that’d be far more productive. For example, lookdev mlop is pretty low hanging fruit. Not sure why we don’t already have models from Autodesk, Epic or even Adobe (with resources ie A100/H100) where you upload an image/video and the model spits out workable 3D scaffolds.


This is a good question, and the answer is that from a tech side it is surprisingly easier to solve the problem in the reverse direction.

As in, making workable 3d models is harder than making video.

And it is easier to make a 3d model by generating a video of the object instead.

Why is that? I don't know. But that's the current state of the industry. 3D model generation is simply harder.


I am thinking reinforcement on top of Blender would be straight forward with unlimited synthetic data potential. I’ve come across people incorporating SD into rendering workflow so tools are all there.


Probably also helps that there's way more image/video data to train on than 3D data.


If I’m not mistaken, Stability just released something like that a few days ago.


Yes, but it works by generating a video first and doing photogrammetry on it to produce a 3D model.


Looks like I completely overlooked threestudio released last year. Thank you for pointing it out.


It’s something I’ve been interested in too. I do a bunch of CNC woodworking so would love the ability to atleast generate close enough 3D models I can then refine.


Nerfs and splatting


I don't understand, from the article, how Sora works when handling a rotation of an object on another object (the leaves in the leaf covered elefant for example). The explanation goes only to the diffussion model, but not to how, from that model, a correct geometry deformation is derived at each step.


I don't get how transformers can replace convolutional networks. My understanding is patches get fed in, and the transformer will do the same thing that a convolution layer does. But transformers deal with sequential data and I don't see any of that here?


Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.


A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons.


From the fine article:

> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.

> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.


I think it just treats the patches like it would be sequentially in memory or disk, but also has coordinates. And they have overlapping patches at an offset to catch features that would span a patch and be missed at that level.


> Total Nvidia H100 needed to support the creator community on TikTok & YouTube: 10.7M / 120 ≈ 89k

If the H100 is $40k worst case, that's one-time cost of $356M! I could definitely see the FAANGs throwing money at this.


Once again we see that scaling laws are the way to better output.

This is why Sama said compute is the currency of the future.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: