This is what a truly revolutionary idea looks like. There are so many details in the paper. Also, we know that transformers can scale. Pretty sure this idea will be used by a lot of companies to train the general 3D asset creation pipeline. This is just too great.
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
It's cool, it's also par for the field of 3D reconstruction today. I wouldn't describe this paper as particularly innovative or exceptional.
What do I think is really compelling in this field (given that it's my profession)?
This has me star-struck lately -- 3D meshing from a single image, a very large 3D reconstruction model trained on millions of all kinds of 3D models... https://yiconghong.me/LRM/
Another thing to note here is this looks to be around seven total days of training on at most 4 A100s. Not all really cutting edge work requires a data center sized cluster.
NNs are typically continuous/differentiable so you can do gradient-based learning on them. We often want to use some of the structure the NN has learned to represent data efficiently. E.g., we might take a pre-trained GPT-type model, and put a passage of text through it, and instead of getting the next-token prediction probability (which GPT was trained on), we just get a snapshot of some of the activations at some intermediate layer of the network. The idea is that these activations will encode semantically useful information about the input text. Then we might e.g. store a bunch of these activations and use them to do semantic search/lookup to find similar passages of text, or whatever.
Quantized embeddings are just that, but you introduce some discrete structure into the NN, such that the representations there are not continuous. A typical way to do this these days is to learn a codebook VQ-VAE style. Basically, we take some intermediate continuous representation learned in the normal way, and replace it in the forward pass with the nearest "quantized" code from our codebook. It biases the learning since we can't differentiate through it, and we just pretend like we didn't take the quantization step, but it seems to work well. There's a lot more that can be said about why one might want to do this, the value of discrete vs continuous representations, efficiency, modularity, etc...
If you’re willing, I’d love your insight on the “why one might want to do this”.
Conceptually I understand embedding quantization, and I have some hint of why it works for things like WAV2VEC - human phonemes are (somewhat) finite so forcing the representation to be finite makes sense - but I feel like there’s a level of detail that I’m missing regarding whats really going on and when quantisation helps/harms that I haven’t been able to gleam from papers.
Quantization also works as regularization; it stops the neural network from being able to use arbitrarily complex internal rules.
But really it's only really useful if you absolutely need to have a discrete embedding space for some sort of downstream usage. VQVAEs can be difficult to get to converge, they have problems stemming from the approximation of the gradient like codebook collapse
Maybe it helps to point out that the first version of Dall-E (of 'baby daikon radish in a tutu walking a dog' fame) used the same trick, but they quantized the image patches.
I mean I don't see a strong reason to turn away from attention as well but I also don't think anyone's thrown a billion parameter MLP or Conv model at a problem. We've put a lot of work into attention, transformers, and scaling these. Thousands of papers each year! Definitely don't see that for other architectures. The ResNet Strikes back paper is a great paper for one reason being that it should remind us all to not get lost in the hype and that our advancements are coupled. We learned a lot of training techniques since the original ResNet days and pushing those to ResNets also makes them a lot better and really closes the gaps. At least in vision (where I research). It is easy to railroad in research where we have publish or perish and hype driven reviewing.
No... a graph convolution is just a convolution (over a graph, like all convolutions).
The difference from a "normal" convolution is that you can consider arbitrary connectivity of the graph (rather than the usual connectivity induced by a regular Euclidian grid), but the underlying idea is the same: to calculate the result of the operation at any single place (i.e., node), you need to perform a linear operation over that place (i.e., node) and its neighbourhood (i.e., connected nodes), the same way that (e.g.) in a convolutional neural network, you calculate the value of a pixel by considering its value and that of its neighbours, when performing a convolution.
As a machine learning engineer who dabbles with Blender and hobby gamedev, this is pretty impressive, but not quite to the point of being useful in any practical manner (as far as the limited furniture examples are concerned.
A competent modeler can make these types of meshes in under 5 minutes, and you still need to seed the generation with polys.
I imagine the next step will be to have the seed generation controlled by an LLM, and to start adding image models to the autoregressive parts of the architecture.
> A competent modeler can make these types of meshes in under 5 minutes.
I don't think this general complaint about AI workflows is that useful. Most people are not a competent <insert job here>. Most people don't know a competent <insert job here> or can't afford to hire one. Even something that takes longer than a professional do at worse quality for many things is better than _nothing_ which is the realistic alternative for most people who would use something like this.
> I don't think this general complaint about AI workflows is that useful
Maybe not to you, but it's useful if you're in these fields professionally, though. The difference between a neat hobbyist toolkit and a professional toolkit has gigantic financial implications, even if the difference is minimal to "most people."
First, we're talking about the state of the technology and what it can produce, not the fundamental worthiness of the approach. Right now, it's not up to the task. In the earliest phases of those technologies, they also weren't good enough for for professional use cases.
Secondly, the number of hobbyists only matters if you're talking about hobbyists that develop the technology-- not hobbyists that use the technology. Until those tools are good enough, you could have every hobbyist on the planet collectively attempting to make a Disney-quality character model with tools that aren't capable of doing so and it wouldn't get much closer to the requisite result than a single hobbyist doing the same.
Is the target market really "most people," though? I would say not. The general goal of all of this economic investment is to improve the productivity of labor--that means first and foremost that things need to be useful and practical for those trained to make determinations such as "useful" and "practical."
Millions of people generating millions of images (some of them even useful!) using Dall-E and Stable Diffusion would say otherwise. A skilled digital artist could create most of these images in an hour or two, I’d guess… but ‘most people’ certainly could not, and it turns out that these people really want to.
>Most people don't know a competent <insert job here> or can't afford to hire one
May be relevant in the long run, but it'll probably be 5+ years before this is commercially available. And it won't be cheap either, so out of the range of said people who can't hire a competent <insert job here>
That's why a lot of this stuff is pitched to companies with competent people instead of offered as a general product to download.
Is there a reason to expect it'd be significantly more expensive than current-gen LLM? Reading the "Implementation Details" section, this was done with GPT2-medium, and assuming running it is about as intensive as the original GPT2, it can be run (slowly) on a regular computer, without a graphics card. Seems reasonable to assume future versions will be around GPT-3/4's price.
Perhaps not, but it begs the question of if GPT is affordable for a dev to begin with. I don't know how they would monetize this sort of work so it's hard to say. But making game models probably requires a lot more processing power than generating text or static images.
I have no doubt that 3d modeling will become commodified in the same way that art has with the dawn of AI art generation over the past year.
I honestly think we'll get there within 18 months.
My skepticism is whether the technique described here will be the basis of what people will be using in ~2 years to replace their low level static 3d asset generation.
There are several techniques out there, leveraging different sources of data right now. This looks like a step in the right direction, but who knows.
People still do wood block printing - even though printing is commodified to the nines.
At the moment, making 3d models is a lot of skilled, monotonous work, especially for stuff like scene furniture. I guess I'd be pretty happy if some of that work could be automated away, and I'm pretty confident that there's no point automating away the remainder, for the same reason you don't want ChatGPT writing your screenplay.
Availability =/= viability. I'm sure as we speak some large studios are already leveraging this work or are close to leveraging it.
But this stuff trickles down to the public very slowly. Because indies aren't a good audience to sell what is likely an expensive tech that is focused on mid-large scale production.
>Most people are not a competent <insert job here>. Most people don't know a competent <insert job here> or can't afford to hire one.
emphasis mine. Affordability doesn't have much to do with capabilities, but it is a strong factor to consider for an indie dev. Devs in fields (games, VFX) that don't traditionally pay well to begin with.
Yes, but you were also saying that it would be that way in 5 years, and if you look at the cases for which these tools were practical at the start of 2023 compared to the practical applications now, not even a full year of progress, the relevance of your argument to the reality of the scene is not clear.
Yes, and I do still think it won't be commercially viable for indie devs in 5 years.
>the relevance of your argument to the reality of the scene is not clear.
I feel it's clear if you're following the conversation chain. Is there something you'd like me to clarify?
Practical applications for a medium-large studio is very different from the practical applications of a solo/small dev team. That's all I'm really getting at. There is all kinds of cool tech from 2010 that still isn't viable for indies but is probably used at every AAA studio, so there's precedent I'm basing this on.
> A competent modeler can make these types of meshes in under 5 minutes
Sweet. Can you point me to these modelers who work on-demand and bill for their time in 5 minute increments? I’d love to be able to just pay $1-2 per model and get custom <whatever> dropped into my game when I need it.
> A competent modeler can make these types of meshes in under 5 minutes
It's not about competent modellers, any more than SD is for expert artists.
It's about giving tools to the non-experts. And also about freeing up those competent modellers to work on more interesting things than the 10,000 chair variants needed for future AAA games. They can work on making unique and interesting characters instead, or novel futuristic models that aren't in the training set and require real imagination combined with their expertise.
Like most of the generative AI space, it'll eliminate something like the bottom half of modelers, and turn them into lower paid prompt wizards. The top half will become combo modelers / prompt wizards, using both skillsets as needed.
Prompt wizard hands work off to the finisher/detailer.
It'll boost productivity and lead to higher quality finished content. And you'll be able to spot when a production - whether video game or movie - lacks a finisher (relying just on generation by prompt). The objects won't have that higher tier level of realism or originality.
>freeing up those competent modellers to work on more interesting things than the 10,000 chair variants needed for future AAA games. They can work on making unique and interesting characters instead, or novel futuristic models that aren't in the training set and require real imagination combined with their expertise.
Or flipping burgers at McDonald's!
There are only so many games that the market can support, and in those, only so many unique characters[0] that are required. We're pretty much at saturation already.
[0]Not to mention that if AI can generate chairs, from what we have seen from Dall-E & SDXL, it can generatte characters too. Less great than human generated ones? Sure, but it's clear that big boys like Bethesda and Activision do not care.
The mesh topology here would see these rejected as assets for in basically any professional context. A competent modeler could make much higher quality models, more suited to texturing and deformation, in under five minutes. A speed modeler could make the same in under a minute. And a procedural system in something like Blender geonodes can already spit out an endless variety of such models. But the pace of progress is staggering.
I see it as a black triangle[0] more than anything else. Sounds like a really good first step that will scale to stuff that would take even a good modeler days to produce. That's where the real value will start to be seen.
Just like a competent developer can use LLMs to bootstrap workflows, a competent model will soon have tools like this as part of their normal workflow. A casual user would be able to do things that they otherwise wouldnt have been able to. But an expert in the ML model's knowledge domain can really make it shine.
I really believe that the more experienced you are in a particular use case, the more use you can get out of an ML model.
Unfortunately, it's those very same people that seem to be the most resistant to adopting this without really giving it the practice required to get somewhere useful with it. I suppose part of the problem is we expect it to be a magic wand. But it's really just the new PhotoShop, or Blender, or Microsoft Word, or PowerPoint ...
Most people open those apps, click mindleslly for a bit, promptly leave never to return. And so it is with "AI".
I think eventually it may settle into what you describe. I don't think it's guaranteed, and I fear that there will be a pretty huge amount of damage done before that by the hype freaks whose real interest isn't in making artists more productive, but in rendering them (and other members of the actually-can-do-a-thing creative class) unemployed.
The pipeline problem also exists: if you need to still have the skillsets you build up through learning the craft, you still need to have avenues to learn the craft--and the people who already have will get old eventually.
There's a golden path towards a better future for everybody out of this, but a lot of swamps to drive into instead without careful forethought.
I can imagine one usecase, in a typical architecture design, where the architect creates a design and always faces this stumbling block, when wanting to make it look as lively as possible: sprinkling a lot of convincing assets everywhere.
As they are generated, variations are much easier to come by easier, than buying a couple asset packs.
This is a very underrated comment... As with any tech demo, I'd they don't show it, it can't do it. It is very very easy to imagine a generalization of these things to other purposes, which, if it could do it, would be a different presentation.
Perhaps one way to look at this could be auto-scaffolding. The typical modelling and CAD tools might include this feature to get you up and running faster.
Another massive benefit is composability. If the model can generate a cup and a table, it also knows how to generate a cup on a table.
Think of all the complex gears and machine parts this could generate in the blink of an eye, while being relevant to the project - rotated and positioned exact where you want it. Very similar to how GitHub Copilot works.
I don't see that LLM's have come that much further in 3D animation than programming in this regard: It can spit out bits and pieces that looks okay in isolation but a human need to solve the puzzle. And often solving the puzzle means rewriting/redoing most of the pieces.
We're safe for now but we should learn how to leverage the new tech.
Probably, but isn't that how most if the technical fields go? Software in particular moves blazing fast and you need to adapt to the market quickly to be marketable.
Some are safe for several years (3-5), that's it. During that time it's going to wreck the bottom tiers of employees and progressively move up the ladder.
GPT and the equivalent will be extraordinary at programming five years out. It will end up being a trivially easy task for AI in hindsight (15-20 years out), not a difficult task.
Have you seen how far things like MidJourney, Dalle, Stable Diffusion have come in just a year or two? It's moving extremely fast. They've gone from generating stick figures to realistic photographs in two years.
The reason AI generative tools are faster to become useful in artistic areas is that in the arts you can take “errors” as style.
Doesn’t apply too much to mesh generation but was certainly the case in image gen. Mistakes that wouldn’t fly for a human artist (hands) were just accepted as part of AIgen.
So these areas are much less strict about precision than coding. Making these tools much more capable are replacing artists in some tasks than CoPilot is for coders atm.
So you're probably familiar with the role of a Bidding Producer; imagine the difficulty they are facing: on one side they have filmmakers saying they just read so and so is now created by AI, while that is news to the bidding producer and their VFX/animation studio clients scrambling as everything they do is new again.
I don't know, 3D CGI has already been moving at the breakneck speed for the last three decades without any AI. Today's tools are qualitatively different (sculpting, simulation, auto-rigging etc etc etc).
3D CGI has gotten faster, but I haven’t seen any qualitative jump for quite some time.
IMO the last time a major tech advance was visible was Davy Jones on the Pirates films. That was a fully photorealistic animated character that was plausible as a hero character in a major feature. That was a breakthrough. After that a lot of refinement and speeding up.
This is different. I have some positivity about it, but it’s getting hard to keep track of everything that’s going on tbh. Every week it’s a new application and every few months it’s some quantum leap.
Like others said, Midjourney and DallE are essentially photorealistic.
It seems to me that the next step is generative AI creating better and better assets.
And then of course you have video generation which is happening as well…
Both DE3 and MJ are essentially toys for single random pictures, unusable in a professional setting. DALL-E in particular has really bad issues with quality, and while it follows the prompt well it also rewrites it so it's barely controllable. Midjourney is RLHF'd to death.
What you want for asset creation is not photorealism, but style and concept transfer, multimodal controllability (text alone is terrible at expressing artistic intent), and tooling. And tooling isn't something that is developed quickly (although there were several rapid breakthroughs in the past, for example ZBrush).
Most of the fancy demos you hear about sound good on paper, but don't really go anywhere. Academia is throwing shit at the wall to see what sticks, this is its purpose, especially when practice is running ahead of theory. It's similar to building airplanes before figuring out aerodynamics (which happened long ago): watching a heavier-than-air thing fly is amazing, until you realize it's not very practical in the current form, or might even kill its brave inventor who tried to fly it.
If you look at the field closely, most of the progress in visual generative tooling happens in the open source community; people are trying to figure out what works in real use and what doesn't. Little is being done in big houses, at least publicly and for now, as they're more interested in a DC-3 than a Caproni Ca.60. The change is really incremental and gradual, similarly to the current mature state of 3D. Paradigms are different but they are both highly technical and depend on academic progress. Once it matures, it's going to become another skill-demanding field.
With respect, I disagree with almost everything you said.
The idea that somehow “AI isn’t art directable” is one I keep hearing, but I remain unconvinced this is somehow an unsolvable problem.
The idea that AIgen is unusable at the moment for professional work doesn’t hold up to my experience since I now regularly use Photoshop’s gen feature.
Photoshop combined with Firefly is exactly the rare kind of good tooling I'm talking about. In/outpainting was found to be working for creatives in practice, and got added to Photoshop.
>The idea that somehow “AI isn’t art directable” is one I keep hearing, but I remain unconvinced this is somehow an unsolvable problem.
That's not my point. AI can be perfectly directable and usable, just not in the specific form DE3/MJ do it. Text prompts alone don't have enough semantic capacity to guide it for useful purposes, and the tools they have (img2img, basic in/outpainting) aren't enough for production.
In contrast, Stable Diffusion has a myriad of non-textual tools around it right now - style/concept/object transfer of all sorts, live painting, skeleton-based character posing, neural rendering, conceptual sliders that can be created at will, lighting control, video rotoscoping, etc. And plugins for existing digital painting and 3D software leveraging all this witchcraft.
All this is extremely experimental and janky right now. It will be figured out in the upcoming years, though. (if only community's brains weren't deep fried by porn...) This is exactly the sort of tooling the industry needs to get shit done.
Ah ok yes I agree. How many years is really the million dollar question. I’ve begun to act as if it’s around 5 years and sometimes I think I’m being too conservative.
You can remain unconvinced but it's somewhat true.
I can keep writing prompts for DE3 or similar until it gives me something like what I want, but the problem is, there are often subtle but important mistakes in many images that are generated.
I think it's really good at portraits of people, but for anything requiring complex lighting, representation of real world situations or events, I don't think it's ready yet, unless we're ready to just write prompts, click buttons and just accept what we receive in return.
Midjourney already has tools that allow you to select parts of the image to regenerate with new prompts, Photoshop-style. The tools are being built, even if a bit slowly, to make these things useful.
I could totally see creating Matte paintings through Midjourney for indie filmmaking soon, and for tiny budget films using a video generative tool to make let’s say zombies in the distance seems within reach now or very soon. Slowly for some kind of VFX I think AI will start being able to replace the human element.
I'm not a professional in VFX, but I work in television and do a lot of VFX/3D work on the side. The quality isn't amazing, but it looks like this could be the start of a Midjourney-tier VFX/3D LLM, which would be awesome. For me, this would help bridge the gap between having to use/find premade assets and building what I want.
For context, building from scratch in a 3D pipeline requires you to wear a lot of different hats (modeling, materials, lighting, framing, animating, ect). It costs a lot of time to get to not only learn these hats but also use them together. The individual complexity of those skill sets makes it difficult to experiment and play around, which is how people learn with software.
The shortcut is using premade assets or addons. For instance, being able to use the Source game assets in Source Filmmaker combined with SFM using a familiar game engine makes it easy to build an intuition with the workflow. This makes Source Filmmaker accessible and its why theres so much content out there made with it. So if you have gaps in your skillset or need to save time, you'll buy/use premade assets. This comes at a cost of control, but that's always been the tradeoff between building what you want and building with what you have.
Just like GPT and DALL-E built a bridge between building what you want and building with what you have, a high fidelity GPT for the 3D pipeline would make that world so much more accessible and would bring the kind of attention NLE video editing got in the post-Youtube world. If I could describe in text and/or generate an image of a scene I want and have a GPT create the objects, model them, generate textures, and place them in the scene, I could suddenly just open blender, describe a scene, and just experimenting with shooting in it, as if I was playing in a sandbox FPS game.
I'm not sure if MeshGPT is the ChatGPT of the 3D pipeline, but I do think this is kind of content generation is the conduit for the DALL-E of video that so many people are terrified and/or excited for.
I think producer roles are a little bit less ultra competitive / scarce as they are actually jobs jobs where you have to use excel and planning and budgeting.
Being a producer means being on the phone all the time, negotiating, haggling, finding solutions where they don’t seem to exist.
Be it in TV, advertising or somewhere in the media space, the common rule is that producers are mostly actually terrible at their jobs, that’s my experience in London. So if she’s really good and really dedicated and learns the job of everyone on set, I’d say she has a shot.
The real secret to being good in filmmaking is learning everyone else’s job. Toyota Production System says if you want to run a production line you have to know how it works.
If she wants to do VFX production she could start doing her own test scenes, learning basics in nuke and Blender, even understanding the role of Houdini and how that works.
If she does that - any company will be lucky to have her.
It looks like the input is itself a 3D mesh? So the model is doing "shape completion" (e.g. they show generating a chair from just some legs)... or possibly generating "variations" when the input shape is more complete?
But I guess it's a starting point... maybe you could use another model that does worse quality text-to-mesh as the input and get something more crisp and coherent from this one.
That's what it seems like. Although this is not an LLM.
> Inspired by recent advances in powerful large language models, we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles.
This is sort of a distinction without a difference. It's an autoregressive sequence model; the distinction is how you're encoding data into (and out of) a sequence of tokens.
LLMs are autoregressive sequence models where the "role" of the graph convolutional encoder here is filled by a BPE tokenizer (also a learned model, just a much simpler one than the model used here). That this works implies that you can probably port this idea to other domains by designing clever codecs which map their feature space into discrete token sequences, similarly.
(Everything is feature engineering if you squint hard enough.)
The only difference is the label, really. The underlying transformer architecture and the approach of using a codebook is identical to a large language model. The same approach was also used originally for image generation in DALL-E 1.
Really the hardest thing with art is details and usually seperates good from bad. So if you can sketch what you want roughly without skill and have the details generated, that's extremely useful. And image to image with the existing diffusion models is useful and popular.
I have no idea about your background when I am commenting here. But these are my two cents.
NO. Details are mostly like icing on top of the cake. Sure, good details make good art but it is not always the case. True and beautiful art requires form + shape. What you are saying is something visually appealing. So, the reason why diffusion models feel so bland is because they are good with details but do not have precise forms and shape. Nowadays they are getting better, however, it still remains an issue.
Form + shape > details is something they teach in Art 101.
It sure feels like every remaining hard problem (i.e., the ones where we haven't made much progress since the 90s) is in line to be solved by transformers in some fashion. What a time to be alive.
The next breakthrough will be the UX to create 3d scenes in front of a model like this, in VR. This would basically let you _generate_ a permanent, arbitrary 3D environment, for any environment for which we have training data.
Diffusion models could be used to generate textures.
edit edit: Maybe credit Lecun or something? Mark going all in on the metaverse was definitely not because he somehow predicted deep learning would take off. Even the people who trained the earliest models weren't sure how well it would work.
Even if this is “only” mesh autocomplete, it is still massively useful for 3D artists. There’s a disconnect right now between how characters are sculpted and how characters are animated. You’d typically need a time consuming step to retopologize your model. Transformer based retopology that takes a rough mesh and gives you clean topology would be a big time saver.
Another application: take the output of your gaussian splatter or diffusion model and run it through MeshGPT. Instant usable assets with clean topology from text.
Lol for 3D artists, this will be used 99% by people who have have never created a mesh by hand in their lifes; to replace their need to hire a 3D artist: programmers who don't want (or can't) pay a designer, architects who never learned nothing other than CAD, fiver "jobs", et al
I don't think people here realize how are we inching to automating the automation itself, and the programmers who will be able to make a living out of this will be a tiny fraction of those who can make a living out of it today.
What you have to understand is that these methods are very sensitive to what is in distribution and out of distribution. If you just plug in user data, it will likely not work.
There’s no shortage of 3D mesh data to train on. Who to say scaling up the parameter count won’t allow for increasingly intricate topology the same way scaling language models improved reading comprehension.
“Make your own game” games will never replace regular games. They target totally different interests. People who play games (vast majority) just want to play an experience created by someone else. People who like “make your own game” games are creative types who just use that as a jumping off point to becoming a game designer.
It’s no different than saying “these home kitchen appliances are really gonna kill off the restaurant industry.”
Hmm I think it will destroy the market in a couple ways.
AI creating video games would drastically increase the volume of games available in the market. This surge in supply could make it harder for indie games to stand out, especially if AI-generated games are of high quality or novelty. It could also lead to even more indie saturation( the average indie makes less than 1000 dollars).
As the market expectations shift, I think most indie development dies unless you are already rich or basically have patronage from rich clients.
The likes of itch.io, Roblox, and the App Store already exist, each with more games than anyone can reasonably curate.
The games market has been in the same place as the rest of the arts for some time now: if you want to be noticed, you have to mount a bit of a production around it, add layers of design effort, and find a marketing funnel for that particular audience. The days of just making a Pong clone passed in the 1970's.
What technology has done to the arts, historically, is add either more precision or more repeatability. The relationship to production and arts as a business maps to what kinds of capital-and-labor-intensive endeavors leverage the tech.
Photographs didn't end painting, they ended painting as the ideal of precisely representational art. In the classical era, just before the tech was good enough to switch, painting was a process of carefully staging a scene with actors and sketching it using a camera obscura to trace details, then transferring the result to your canvas. Afterwards, the exact scene could be generated precisely in a photo, and so a more candid, informal method became possible both through using photographs directly and using them as reference. As well, exact copies of photographs could be manufactured. What changed was that you had a repeatable way of getting a precise result, and so getting the precision or the product itself became uninteresting. But what happened next was that movies and comics were invented, and they brought us back to a place of needing production: staged scenes, large quantities of film or illustration, etc.
With generative AI, you are getting a clip art tool - a highly repeatable way of getting a generic result. If you want the design to be specific, you still have to stage it with a photograph, model it as a scene, or draw it yourself using illustration techniques.
And so the next step in the marketplace is simply in finding the approach to a production that will be differentiating with AI - the equivalent of movies to photography. This collapses not the indie space - because they never could afford productions to begin with - but existing modes of mobile gaming, because they were leveraging the old production framework. Nobody has need of microtransaction cosmetics if they can generate the look they want.
Maybe if you were talking about the generative AI from 1 year ago.
The incredibly fast evolution is makes most of your points irrelevant.
For example ai art doesn't need prompt engineers as jobs anymore because it alot of prompt engineering is already being absorbed by other ai's.
The chaining of various AI's and the feedback loops between are accelerating far beyond what people think it is.
Just yesterday major breakthroughs were released on stable diffusion video.
It's the pace and categorical type of these breakthroughs that represent a paradigm shift, never seen before in the creative fields.
I have yet to see any evidence that would convince me that generative AIs can produce compelling gameplay. Furthermore, even the image generation stuff has a lot of issues, such as making all the people in an image into weird amalgamations of each other.
I couldn't disagree more. RPGMaker didn't kill RPGs, Unity/Godot/Unreal didn't kill games, Minecraft didn't kill games, and Renpy didn't kill VNs.
Far more people prefer playing games than making them.
We'll probably see a new boom of indie games instead. Don't forget, a large part of what makes the gaming experience unique is the narrative elements, gameplay, and aesthetics - none of which are easily replaceable.
This empowers indie studios to hit a faster pace on one of the most painful areas of indie game dev: asset generation (or at least for me as a solo dev hobbyist).
Sorry I guess I wasn't clear. None of those things made games automatically.
The future is buying a game making game, and saying I want a zelda clone but funnier.
The ai game framework handles the full game creation pipeline.
The issue with that is that it probably produces generic-looking games, since the AI can't read your mind. See ChatGPT or SD for example, if you just say "write me a story about Zelda but funnier" it will do it, but it's the blandest possible story. To truly make it good requires a lot of human intention and direction (i.e. soul), typically drawn from our own human experiences and emotions.
People who use "make your own game" games aren't good at making games. They might enjoy a simplified process to feel the accomplishment of seeing quick results, but I find it unlikely they'll be competing with indie developers.
Careful with that generalization. Game-changing FPS mods like Counterstrike were basically "make your own game" projects, built with the highest-level toolkits imaginable (editors for existing commercial games.)
Yeah, and if there was going to be such a tool, people who invest more time in it would be better than those casually using it. In other words, professionals.
Not really, "I" can make 2D pictures that look like masterpieces using stable diffusion and didn't invest more than 6 hours playing with it, the learning curve is not that high, and people already have a hard time telling apart AI art than those from real 2D masters who have a lifetime learning it, the same thing will happen with making videogames and 3D art.(Yeah nothing of this looks exiting to me, actually it looks completely bleak)
I didn't mean comparing it to human-created art, I meant comparing it to other AI generated or assisted artworks. Currently the hard parts of that would probably be consistency, fidelity (e.g. multiple characters) and control, which definitely stands out when compared against the casual raw gens.
The platform layer of the "make your own game" game is always too heavy and too limited to compete with a dedicated engine in the long run. Also the monetization strategy is bad for professionals.
There are more amazing, innovative and interesting indie games being created now than ever before. There's just also way more indie games that aren't those things.
Dang, this is getting so good! Still got a ways to go, with the weird edges, but at this point, that feels like 'iteration details' rather than an algorithmic or otherwise complex problem.
It's really going to speed up my pipeline to not have to pipe all of my meshes into a procgen library with a million little mesh modifiers hooked up to drivers. Instead, I can just pop all of my meshes into a folder, train the network on them, and then start asking it for other stuff in that style, knowing that I won't have to re-topo or otherwise screw with the stuff it makes, unless I'm looking for more creative influence.
Of course, until it's all the way to that point, I'm still better served by the procgen; but I'm very excited by how quickly this is coming together! Hopefully by next year's Unreal showcase, they'll be talking about their new "Asset Generator" feature.
Oh man, sorry, I wish! I've been using cobbled together bits of python plugins that handle Blender's geometry nodes, and the geometry scripts tools in Unreal. I haven't even ported over to their new proc-gen tools, which I suspect can be pretty useful.
Games and pretty much any other experience being generated by AI is obvious to anyone paying attention at this point. But how would it work. Are current ai generated images and videos using rasterisation? Will they use rasterisation, path tracing or any other traditional rendering technique, or is will it be an entirely different thing.
I'm not a 3D artist, but why are we still, for lack of a better word, "stuck" with having / wanting to use simple meshes? I appreciate the simplicity, but isn't this an unnecessary limitation of mesh generation? It feels like an approach that imitates the constraints of having both limited hardware and artist resources. Shouldn't AI models help us break these boundaries?
My understanding is that it's quite hard to make convex objects with radiance fields, right? For example the furniture in OP would be quite problematic.
We can create radiance fields with photogrammetry, but IMO we need much better algorithms for transforming these into high quality triangle meshes that are usable in lower triangle budget media like games.
"Lower triangle budget media" is what I wonder if its still a valid problem. Modern game engines coupled with modern hardware can already render insane number of triangles. It feels like the problem is rather in engines not handling LOD correctly (see city skylines 2), although stuff like UE5 nanite seems to have taken the right path here.
I suppose though there is a case for AI models for example doing what nanite does entirely algorithmically and research like this paper may come in handy there.
I was referring to being stuck with having to create simple / low tri polygonal meshes as opposed to using complex poly meshes such as photogrammetry would provide. The paper specifically addresses clean low poly meshes as opposed to what they call complex iso surfaces created by photogrammetry and other methods
Lots of polys is bad for performance. For a flat object like a table you want that to be low poly. Parallax can also help to give a 3D look without increasing poly count.
Fantastic, but still useless from a professional perspective. i.e. A mesh that represents a cube as 12 triangles is a better prestation of the form than previous efforts, but barely more usable.
Whilst it might not be the solution I'm waiting for, I can now see it as possible. If an AI model can handle traingles, it might handle edge loops and NURBS curves.
This is fantastic! You can broad-strokes sketch the key strokes of the shape you want, and this will generate some "best" matches around that.
What I really appreciate about this is that they took the concept (transformers) and applied it in a quite different-from-usual domain. Thinking outside of the (triangulated) box!
So you train it with vector sequences that represent furnitures and it predicts the next token(triangles), so how is this different from it ChatGPT was trained with the same sequences and can output all the 3d locations and trangle size/lengths in sequence and have a 3d program piece it together?
Great work. But I don't get from the demo how it knows what object to autocomplete the mesh with - if you give it four posts as an input, how does it know to autocomplete as a table and not a dog?
So maybe the next step is something like CLIP, but for meshes? CLuMP?
It would be nice to be see work and be part of a field that did work that humans could not do, instead of creating work that just replaces what humans already know how to do.
Their comparison against PolyGen looks like it's a big improvement. What are the limitations that this has in common with PolyGen that make it still not useful?
I don’t think it’s as widely applicable as they try to make it seem. I have worked specifically with PolyGen, and the main problem is “out of distribution” data. Basically anything you want to do will likely be outside the training distribution. This surfaces as sequencing. How do you determine which triangle or vertex to place first? Why would a user do it that way? What if I want to draw a table with the legs last? Cannot be done. The model is autoregressive.
First, you use the word "transformers" to mean "autoregressive models", they are not synonymous, second, this model beats Polygen on every metric, it's not even close.
"We first learn a vocabulary of latent quantized embeddings, using graph convolutions, which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder, ensuring that they can effectively reconstruct the mesh."
This idea is simply beautiful and so obvious in hindsight.
"To define the tokens to generate, we consider a practical approach to represent a mesh M for autoregressive generation: a sequence of triangles."
More from paper. Just so cool!