This is only a made-up issue for a few that are looking for something to criticize. Almost nobody cares, in the sense that appears to be meant here about "ownership" of the training data. Any this unfortunately hampers research and understanding of models because companies are reluctant to talk about training lest the trolls start jumping on. We're all worse off because of this.
Is it really so hard to imagine that someone might not like the idea of their work being used to train a machine to imitate their work with no compensation to them? And that machine is then instead used to benefit the shareholders of large corporations?
The position that this is a made-up issue when there are multiple large pending lawsuits about exactly this thing is pretty bizarre.
Having open access to the training data is how you prevent poisoning/biasing of the dataset. People complaining about bad data in the dataset improve the quality of the dataset. That's in addition to the benefit of creators being labeled in the dataset.
Hiding the data from public view seems to only helps nefarious actors.
> Any this unfortunately hampers research and understanding of models because companies are reluctant to talk about training lest the trolls start jumping on
Respecting artists and being open about training data should go hand in hand. That companies feel the need to hide the training data from public scrutiny should immediately be suspect.
It seems like you are saying no one cares about copyright, I inform you that is not the case. I disagree with (most current forms of) copyright, but I do respect artists and their need to feed themselves. Proper attribution, and labeling and scrutiny of the dataset is imperative.
>This is only a made-up issue for a few that are looking for something to criticize. Almost nobody cares, in the sense that appears to be meant here about "ownership" of the training data
So it's not just 'trolls' that want the data to be open and labeled, is my point. If the companies are hurting artists ('s economic output), that should be examined and fixed (stopped and reattuned attention of said companies).
'trolling (bothering)' a company to be 'good (not against human interests)' isn't a bad thing.
What’s the status on companies building AI models to build actual 3D backend behind these generative videos. Anyone working on something similar? Imagine that’d be far more productive. For example, lookdev mlop is pretty low hanging fruit. Not sure why we don’t already have models from Autodesk, Epic or even Adobe (with resources ie A100/H100) where you upload an image/video and the model spits out workable 3D scaffolds.
I am thinking reinforcement on top of Blender would be straight forward with unlimited synthetic data potential. I’ve come across people incorporating SD into rendering workflow so tools are all there.
It’s something I’ve been interested in too. I do a bunch of CNC woodworking so would love the ability to atleast generate close enough 3D models I can then refine.
I don't understand, from the article, how Sora works when handling a rotation of an object on another object (the leaves in the leaf covered elefant for example). The explanation goes only to the diffussion model, but not to how, from that model, a correct geometry deformation is derived at each step.
I don't get how transformers can replace convolutional networks. My understanding is patches get fed in, and the transformer will do the same thing that a convolution layer does. But transformers deal with sequential data and I don't see any of that here?
Transformers are not limited to sequential data. They can process any form of data you can tokenize, as long as they have enough of it to learn the patterns and structures it contains.
A transformers model was probably chosen because of its scaling properties and because it's easy to mask the attention layer so that you can fit in the same batch multiple video and images of different lengths and dimensions, this is important because every batch needs to be the same size to fit in the same mini-batch and for performance reasons.
> Now for the second point, both DiT and Sora replace the commonly-used U-Net architecture with a vanilla Transformer architecture. This matters because the authors of the DiT paper observe that using Transformers leads to predictable scaling: As you apply more training compute (either by training the model for longer or making the model larger, or both), you obtain better performance. The Sora technical report notes the same but for videos and includes a useful illustration.
> This scaling behavior, which can be quantified by so-called scaling laws, is an important property and it has been studied before in the context of Large Language Models (LLMs) and for autoregressive models on other modalities. The ability to apply scale to obtain better models was one of the key drivers behind the rapid progress on LLMs. Since the same property exists for image and video generation, we should expect the same scaling recipe to work here, too.
I think it just treats the patches like it would be sequentially in memory or disk, but also has coordinates. And they have overlapping patches at an offset to catch features that would span a patch and be missed at that level.
The elephant in the room, of course, is "where did Sora's dataset come from?"