Swallowing the elephant into Blender

dahart · on July 22, 2022

What a great example of how a series of reasonable decisions can add up to something unreasonable when you scale up the inputs. This line is my favorite: “there was an open task for a few years to address it (T73412), and so I did it.”

This situation reminds me a little bit of trying to work with very large images in ImageMagick, like 60k x 60k resolution. A simple resize was taking 6 hours on my mac 10 years ago, due to IM trying to allocate the whole image at once and then swapping non-stop. And then I discovered that the Graphics Magick fork did streamed resizing, and did the task in a couple of minutes. It’s a small and relatively easy change, but someone had to prioritize handling large inputs first. This is one reason software is hard for me; every time I need to process large inputs I wish I had written a streaming mechanism, but every time I start a project I decide to do the easy thing first and wait until I critically need streaming. Maybe it’s the right thing to do to avoid over-engineering, but this comes up often enough that I’m usually in a mild state of frustration about something being under-developed.

ocimbote · on July 22, 2022

> every time I start a project I decide to do the easy thing first and wait until I critically need streaming.

Absolutely reasonable decision making. Do what you need at the time you write the code, or at best, be slightly ahead of the curve. Don't forecast features you do not need. Humans are unreasonable about success of their own projects.

>Maybe it’s the right thing to do to avoid over-engineering

It is the only thing to do. You do not know the future use cases.

> [...] but this comes up often enough that I’m usually in a mild state of frustration about something being under-developed.

If your software is subpar with regards to your needs, congrats, it is successful software!

Building the perfect software from the beginning is unreasonable because you might just never release or even use it.

Build imperfect things, and be reasonably frustrated BH them so that it gives you the energy to improve them!

acomjean · on July 22, 2022

>Building the perfect software from the beginning is unreasonable because you might just never release or even use it.

I built a software library for an external device from specification. We eventually got the device to test and it worked when plugged in, but not everything was right.. I was early in my career and disappointed. My boss said "If it works perfectly the first time, you spent too much time on it.".

ocimbote · on July 22, 2022

That was a good boss

tecleandor · on July 22, 2022

Oh yes. I worked for year in medical imaging and we suffered greatly when dealing with sanned pathology slides, that easily have a long side of 30k to 200k pixels, and we had to deal with big volumes of them, go figure.

But then we found libvips (that's not only a library but a cli tool too) and it was like night and day in big images. Like 10x faster and 1/10 memory use. I really loved it!

https://www.libvips.org/

kaladin-jasnah · on July 22, 2022

I can second libvips, I had to work on a project that needed to resize GIFs and vips was probably 10 times faster than the ImageMagick or Pillow. I think the only thing that compared to libvips was ffmpeg.

tecleandor · on July 23, 2022

In a quick test I've done resizing to half size a 30k x 26k pixel painting scan [1], vips is half the time and almost a tenth of memory use.

Of course, if you're using that much memory (with imagemagick), once you begin to parallelize and do batch conversion, you fill your memory super quickly, you begin to swap, and then the conversion times go through the roof.

Image Magick

   time convert original.jpg -resize 50% im.jpg
  convert original.jpg -resize 50% im.jpg   15.99s  user 4.78s system 131% cpu 15.770 total
  avg shared (code):         0 KB
  avg unshared (data/stack): 0 KB
  total (sum):               0 KB
  max memory:                1909 MB
  page faults from disk:     485098
  other page faults:         7059

vips

   time vips resize original.jpg vips.jpg 0.5
  vips resize original.jpg vips.jpg 0.5   13.38s  user 0.15s system 195% cpu 6.910 total
  avg shared (code):         0 KB
  avg unshared (data/stack): 0 KB
  total (sum):               0 KB
  max memory:                255 MB
  page faults from disk:     1
  other page faults:         60319

1: https://commons.wikimedia.org/wiki/File:Hans_Holbein_der_J%C...

tecleandor · on July 24, 2022

Seems like imagemagick was coredumping and my measurement wasn't correct. So I raised imagemagick memory limits and I made sure both were using the same jpeg quality and downsampling filter and:

ImageMagick

   time convert original.jpg -filter Lanczos -set option:filter:lobes 3 -resize 50% -quality 75% im.jpg
  91.43s  user 3.65s system 424% cpu 22.396 total
  avg shared (code):         0 KB
  avg unshared (data/stack): 0 KB
  total (sum):               0 KB
  max memory:                10553 MB
  page faults from disk:     0
  other page faults:         2993653

vips

   time vips resize original.jpg "vips.jpg[Q=75]" 0.5 --kernel lanczos3
  13.00s  user 0.15s system 194% cpu 6.771 total
  avg shared (code):         0 KB
  avg unshared (data/stack): 0 KB
  total (sum):               0 KB
  max memory:                255 MB
  page faults from disk:     11
  other page faults:         60951

So Imagemagick does: 40x the memory, almost 4x the time, almost 9x the CPU time.

Versions and computer:

  Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (8 cores)
  16GB RAM
  Debian 11
  Kernel 5.18.8-xanmod1-x64v2 (software not rebuilt for x64v2 instructions)
  ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
  vips-8.12.2-Mon Feb 28 21:28:00 UTC 2022

Options for zsh time:

  TIMEFMT='%U  user %S system %P cpu %*E total'$'\n'\
  'avg shared (code):         %X KB'$'\n'\
  'avg unshared (data/stack): %D KB'$'\n'\
  'total (sum):               %K KB'$'\n'\
  'max memory:                %M 'MB''$'\n'\
  'page faults from disk:     %F'$'\n'\
  'other page faults:         %R'

flafla2 · on July 22, 2022

Very happy that Aras is fighting the good fight, doing some great OSS work after his departure from Unity. The speed up numbers are quite impressive!

I remember when I was first getting into shader programming 10+ years ago, I asked a question on the Unity forums and Aras was one of the first to respond. Even then he was a titanic figure behind Unity’s graphics stack. I never forgot how kind he was to help a kid like me out, I’m sure he was quite busy!

prox · on July 22, 2022

He writes really clearly as well. In depth but not sofar as not being able to follow it.

ludston · on July 22, 2022

> The speedup factor is order-dependent.

This is why it is very difficult to justify optimization work to management. If there are 20 things to optimize that take 10 seconds each, the change isn't really noticeable until you're getting past half-way. And once your processing already takes a few minutes, what's the harm in adding another 10 seconds?

dec0dedab0de · on July 22, 2022

I once shipped a crud app a few weeks earlier than I wanted to because of pressure from managers. I warned them that I hadn't done any optimizations, and it would quickly become unusable with real world use. So naturally, as soon as it was deployed I was reassigned to another project. 3 weeks later the users were complaining it was slow, in 6 weeks it was taking 20 minutes to load. They finally agreed to let me optimize it, in an hour I had it to 5 minutes, in a week I had it loading in 10 seconds. I could get it faster, but I was reassigned again, and at this point it is plenty fast for the users in question.

If I had done the easy change to get it to 5 minute before deploying, it would likely still be that way today annoying my users, but not enough to justify changing.

ridgered4 · on July 22, 2022

As a former colleague of mine was fond of saying, broken gets fixed but crappy lasts forever.

gc22browsing · on July 22, 2022

There is a similar effect, but in reverse, when adding interdependent features. Early ones don't have a big impact, but once you get to a certain point the inefficiencies add up and the program becomes bogged down.

Cache invalidations and memory swapping as you approach the limits are other examples.

atoav · on July 22, 2022

Well it depends what you are doing. If the thing you are optimizing is the thing blocking everything else routinely, spending a lot of resources on optimization is a no-brainer. If the thing you are optimizing takes long, but that does not matter so much, it is not worth optimizing.

When you are creating VFX scenes for a still image you don't really care a lot about the render times as long as it is done the next day or so. If you are doing it for animation, you care a lot about render times, because anything times a thousand will be quite a duration.

The work of a programmer is always a factor in a multiplication. If the small change you make is used by thousand people a day, using it 100 times a day each second you shave off will save a collective time of ~28 hours. And this is only time. You could also think about electricity, about frustration, etc.

A programmers work is always multiplied into the world. It is a huge responsibility - and I think we all should act more like it.

seer · on July 22, 2022

This is of course very good advice and I follow it myself quite often, but there is something to say about quantitate changes leading to qualitative outcomes. Especially with changes with performance in a factor of 150 and the like.

Some workflows that were considered “too long to be worth it” suddenly become easy and routine. It’s kind of like “disrupting the market” in a sense.

And more times than I can count small changes in performance that were considered irrelevant led to very big cultural changes in a company.

For example we had some e2e tests running in our CI/CD pipeline that were taking ~ 15 minutes, people were stressed (or not very productive) as the usual excuse was “I’m waiting for the tests” with not enough time to do something productive but too long to wait patiently. I spend like a day optimizing it and brought them down to 4 mins, suddenly people began to accomplish more, they started writing a lot more tests.

So a day of investment led to happier devs (less turnover), more tests (stability improvement) and faster feature turnaround. And I had to fight tooth and nail to spare the time to actually do this.

nullc · on July 22, 2022

> A programmers work is always multiplied into the world. It is a huge responsibility - and I think we all should act more like it.

Unfortunately this same fact also means that varrious kinds of important software are hard to get developed.

When someone writes code to blink a mildly annoying advert on youtube it effects a billion people and can make millions or hundreds of millions in revenue. But code needed to make a local car wash' robots more efficient? -- unlikely to get written unless someone thinks they can sell it to a chain: a large number of programmers get snatched up by places like google that deploy to billions of people. Even though in the past when there were far fewer qualified programmers it might have been easier to get the car wash software developed because there simply was nowhere that could deploy software to a billion people instead (much less highly profitably).

The enormous leverage of software is an undeniable force for good. But it also changes the incentive structures of the world in ways that have negative effects too. :(

This isn't limited to software either, improvements to mass production has made mass produced goods extremely inexpensive-- but by that same token custom work has become much more expensive. And the world around us has become much more homogenized and cookie cutter as a result. But the leverage that software potentially has is vastly greater than other things because of its zero marginal cost of production.

madduci · on July 22, 2022

I might also be interested in how long did it take to reach such a massive improvement. I guess the OP has spent/waited (in a CI Job?) countless hours testing the import of large files until they finalized it.

randomifcpfan · on July 22, 2022

I think the author was smarter than that. They used profiling tools to identify the bottlenecks, then used much smaller/faster test cases to fix the bottlenecks.

aaaaaaaaaaab · on July 22, 2022

>The moral this time might be, try to not have functions that work on one object, and are O(N_objects) complexity. They absolutely might be fine when called on one object. But sooner or later, someone will need to do that operation on N objects at once, and whoops suddenly you are in quadratic complexity land.

I usually put it this way to my colleagues: on a long enough timescale, every function will be called from a `for` loop.

chmorgan · on July 22, 2022

Excellent writeup and great to see the fixes upstreamed, all driven by a new use case that perhaps was beyond the magnitude of previous import efforts.

virtualritz · on July 22, 2022

While impressive, this misses the point. The Moana scene is a test scene for /renderers/.

It is not meant to be used inside a DCC app, as-is.

If you import data of this complexity into a DCC app your workflow is broken/you are doing something wrong.

That said – I think it's very cool if you can import such heavy geometry and your DCC app doesn't crash. But in 15 years of working in VFX I never dealt with heavy data inside a DCC directly. Indirectly yes, through proxies.

The three most interesting metrics for this scene are, if you're a renderer author (sorted by importance during lookdev/lighting):

1. Time to first pixel (is it seconds or hours?).

2. Time to completion (does it take days/hours/minutes?).

3. Memory footprint (does it 'fit' or does it go into swap?).

Two more are: subdivs and PTex.

I.e. can the renderer do true subdivision surfaces (vs just subdividing the geometry n-times leading to silhouette artifacts under certain viewing conditions)?

And: can the renderer ingest PTex textures or do you have to create UVs somehow and convert all the textures into UV-based ones before you can use the original Disney dataset?

jcelerier · on July 22, 2022

> It is not meant to be used inside a DCC app, as-is.

> If you import data of this complexity into a DCC app your workflow is broken/you are doing something wrong.

are you sure you're not just saying this because years of software being slow made people cargo-cult workarounds around the issue which are now assumed to be "the way" in the field ?

virtualritz · on July 22, 2022

> are you sure you're not just saying this because years of software being slow [...]

Yes. And I don't know where to even start explaining this. Don't get me wrong.

"Simply put": there is no use case for this. Moana e.g. has gazillions of vegetation geometry instances. Why would you have those inside a DCC? Pebbles on the shore, etc. etc.

VFX pipelines are specialized. An animator does not care if there are 10 million hairs on their furry character while they do the animation. A lighter only cares to see the fur when it's rendering, not while operating the light rig in the DCC. Modern pipelines have instant feedback/viewport rendering where proxy geometry is swapped out for "the real thing" by the renderer (or a custom plugin inside the renderer) on the fly.

The result, inside the renderer, is something like the Moana dataset. Hence its publication. It's a stress test for renderers, not DCCs.

jcelerier · on July 22, 2022

> "Simply put": there is no use case for this.

there can't be for now, because artists couldn't realistically do this. I've been working with artists for, what, close to ten years now and I can guarantee you that their use of the software permanently fills the performance increase that such software gives. That's like saying, why would a music sequencer need to support more than, say, 100000 tracks, which sounds ridiculous until you see someone actually coming up with a fun music score that leverages generative scripting to create very cool pieces

virtualritz · on July 22, 2022

The performance increase is eaten by scene complexity.

This is and has been the norm since the advent if CGI in blockbuster movies, in the 1970's.

For every order of magnitude that hardware gets faster (and/or software gets 'better', i.e. using better parallelization etc.), image complexity in your average shot increases by an order of magnitude.

So yes, what artists are doing is more complex but relatively so. I.e. it doesn't invalidate anything I and others wrote about the case in point – the Moana asset, how such assets are built and what part of a VFX pipeline reasonably ingests them.

orbital-decay · on July 22, 2022

The software is fast where it's needed - it makes the right trade-offs. There's little point in building IDEs capable of handling terabyte-sized source files, because just having a terabyte source file is ridiculous.

Scenes like Moana are not intended to be handled as a whole, they are assembled from little pieces - semantically independent and assignable to different people with different skill sets. It's not the matter of software performance, it's the matter of process organization.

HelloNurse · on July 22, 2022

At some point, the little pieces need to be assembled, reviewed and finally rendered. Why wouldn't you do that with the DCC application rather than with specialized, limited tools?

virtualritz · on July 22, 2022

Indeed, you often do that "somehow" inside the DCC.

But again: the DCC doesn't see the high detailed version of that geometry in a proper pipeline. It only sees a proxy. See my detailed reply above.

Only the renderer does see the final geometry and even that may be generated procedurally, on demand. I.e. stuff like fur, particles or the like is often generated on the fly since caching (and then reading) this data is not only impractical but also may be slower.

Think a terabyte of fur data for a horde of fluffy monsters on screen (per frame, of which there are 24 per second of film at least) that needs to be distributed to each blade on a render farm or, worse, machines in the cloud (again and again since each frame goes through many revisions in lighting/rendering before its final).

Computing the fur from a few hundred thousand 'guide hairs' on each character, on the fly, is a better approach.

slavik81 · on July 24, 2022

The Blender OBJ loader was incredibly inefficient. Blender used something like two orders of magnitude more RAM than necessary, so even a small model would take gigabytes. It was basically useless for any model of even modest complexity. Working around its limitations was dozens of hours of effort when I was importing and rendering my own procedurally-generated models.

I was seriously considering rewriting the Blender OBJ importer myself. Ultimately, I ended up writing an OBJ to Alembic converter, since Blender handled those more efficiently.

fuyu · on July 22, 2022

"The point" was to find a large project to test import performance, and it sounds like they hit that point dead center.

virtualritz · on July 22, 2022

What is the point of importing a dataset of this complexity if you can't also work with the data inside the DCC?

Try the thousands of tools and plugins available for Blender with such data and see what happens. Good luck.

I guess my point is: you need to also "fix" any of those that crash/hang/are too slow to make this worthwhile.

Just to be clear again: awesome they fixed the bugs that lead to crashes with this data. Not sure if that large a dataset was required at all to do that though.

But the speed improvements? Do they matter for your average OBJ? And if no one ever imports data of the complexity of the Moana dataset, because they can't work with it afterwards anyway, any speed improvement that is not felt in the average use case is a nice engineering exercise, foremost.

There is a reason why most commercial DCCs, too, struggle with data sets of this size and no one has ever "fixed" this. ;)

rmetzler · on July 22, 2022

> What is the point of importing a dataset of this complexity if you can't also work with the data inside the DCC?

My understanding is, that you need to reproduce the rendering bug which crashes blender to be able to fix it. And being able to reproduce it, needs to be fast. Even if you have a smaller scene which would trigger this bug, without the optimisations it would take a lot of important time in the feedback loop. Now you have a workflow which crashes the rendering in less than 2 minutes.

virtualritz · on July 22, 2022

See this reply from someone else: https://news.ycombinator.com/item?id=32190577

rmetzler · on July 22, 2022

I really don't understand your point at all. You might be right, the use case doesn't exist yet, maybe never. But this never was the point of the blog post.

Making the application wait extremely long or even crash by importing _something_ is a bug in my understanding. Why shouldn't it be fixed? Why shouldn't developers improve performance and blog about it, so other devs learn from it?

It's not about the Moana scene, that's just the test case, so OP has a valid benchmark with human comprehensible durations. The scene could be anything that is smaller and it will be imported faster now.

andai · on July 22, 2022

See the reply to that:

https://news.ycombinator.com/item?id=32191173

>At some point, the little pieces need to be assembled, reviewed and finally rendered. Why wouldn't you do that with the DCC application rather than with specialized, limited tools?

virtualritz · on July 23, 2022

See the explanations in my other replies on this topic.

It is impossible to hold a typical VFX scene in RAM to start with.

Even freelancers doing sim/FX work now have at least 128GB RAM and this is often just enough for proxy work that still gets expanded at render time.

I.e. consider the possibility that your worst estimate of how complex this data could be is off by 1-3 orders of magnitude.

And for your example: the person who signs this off is the VFX supervisor. They don't sign off anything but final frames.

colechristensen · on July 22, 2022

It’s a stress test. It doesn’t have to be reasonable. This is what you do for performance improvement, you push as hard as you can to find the worst actors and fix them.

Performance improvements like this expand the horizons of what is reasonable to do.

There’s a circular logic problem where you argue that performance problems shouldn’t be fixed because nobody actually does this, but nobody does it because of performance problems. No, this effort didn’t solve every performance problem required to do a thing, but did it solve some of it.

“We didn’t do everything so we should have done nothing” is not how you build a performant product.

cinntaile · on July 22, 2022

He sped up parts of the blender pipeline, someone will no doubt benefit.

He does address that the crashing also needs to be fixed but that he doesn't want to be the guy doing that.

knolan · on July 22, 2022

I disagree, the point was to show how object importing scales and he used an extreme example to make the point. He also gives an example of duplicating 10k cubes in Blender (something not out of the ordinary) also sees a massive speed up.

This kind of thing can help understand other bottlenecks in software like Blender as its users and developers get more ambitious as the software grows in capability.

Idealistically wouldn’t one like film makers to be able to work ‘in world’ decoupled from technical limitations?

lukego · on July 22, 2022

I've seen much the same problems when trying to programatically construct a large number of objects using the (low-level) Python API. If this kind of optimization improves throughput of those APIs that would be a major enabler for work that I want to do.

Otherwise I'm side-tracked looking for relatively awkward SIMD-style programming (e.g. Geometry Nodes) as a workaround for "CRUD" operations being too slow. Can't be just me...

ChrisRR · on July 22, 2022

It doesn't miss the point at all. It's a large data set and they used it to improve a function of Blender. There's nothing that says it only has to be used to test renderers.

sytelus · on July 22, 2022

Is there a video of this scene?

erichocean · on July 22, 2022

https://www.youtube.com/watch?v=p0EJo0pZ3QI

leeoniya · on July 22, 2022

i always try to ensure my code scales at worst linearly on 10x to 100x the expected data size, or at 1/10th the bandwidth or 10x the latency, or constrained cpu or memory (mobile).

pays big dividends in the long run, every time.

ConfusedznewGuy · on July 22, 2022

I searched through all the example pictures for an elephant, but could not find one

simonh · on July 22, 2022

It got swallowed.