Fast Virtual Functions: Hacking the VTable for Fun and Profit

aappleby · on March 23, 2024

Having spent a good chunk of my career optimizing C++ game engines - don't do this.

If virtual vs non-virtual function calls are causing you performance problems, you're calling way too many functions per frame. Batch the objects by concrete type and process them all in a loop.

nly · on March 23, 2024

GCC even has a "bound member functions" extension to help here

https://gcc.gnu.org/onlinedocs/gcc/Bound-member-functions.ht...

cleak · on March 23, 2024

Author here. I'm a bit confused by this response. Sure, going full ECS and processing all objects in a single (or few) function calls is likely to be faster. But there's an obvious tradeoff in solution complexity.

Both Unreal and Unity make heavy use of per-instance per-frame virtual functions. It's a paradigm that has clear value. Why not make it cheaper at no cost to the dev? The option to take a systems approach to bottlenecks is the same afterwards.

flohofwoe · on March 23, 2024

> Both Unreal and Unity make heavy use of per-instance per-frame virtual functions. It's a paradigm that has clear value.

It's a paradigm that had (debatable) value in the late 1990's and early 2000's when Unreal Engine and Unity had been designed and OOP was all the rage ;)

In the meantime we realized that not everything should in fact be an (OOP) object and runtime polymorphism is usually not needed to build games, even if the whole "animal => mammal => dog" classification thing at first sounds like it would be a good match to model game objects.

With runtime polymorphism it's like with memory management. If memory management shows up in profiling at all, it's better to drastically reduce the frequency of allocations instead of looking for a faster general purpose allocator (and if it doesn't show up in profiling as it should be, integrating a faster allocator doesn't make much sense either because it won't make a difference).

Of course stuff like this is not easy to do late into a project because it would involve a complete redesign of the architecture and rewriting all code from scratch - and that's why dropping in a faster allocator sometimes makes sense as a compromise, but it's not a fix for the underlying problem, just a crutch).

Also, the more important problem with indirect function calls than the call overhead is usually that they present a hard optimization barrier for the compiler.

cleak · on March 23, 2024

> It's a paradigm that had (debatable) value in the late 1990's and early 2000's when Unreal Engine and Unity had been designed and OOP was all the rage ;)

For inheritance, I 100% agree. Composition all the way. I think it has value as an interface though - at least for quick bring up and fast iteration. It can of course bring scaling challenges - I recently worked on a project that had hundreds of devs and more than 50k game components. That brought all of the architectural and performance challenges you'd expect from this approach.

> Also, the more important problem with indirect function calls than the call overhead is usually that they present a hard optimization barrier for the compiler.

In the years I've had to think about this, I'd take a slightly different approach that should be more amenable to compiler optimization. I'd maintain separate lists for each concrete type and have a type aware process function (via templates) which requires all overrides to be marked final. That should allow the compiler to do inlining, avoid indirections, etc. The major downside here is handing over a footgun to the dev - forget that final keyword or pass the object in not as concrete type and performance will suffer. I'd probably still walk the vtable to see if a function has been overridden - it's unfortunate that there doesn't seem to be a way to do this without resorting to such tricks.

geocar · on March 23, 2024

Hi Author. I did not say that, but I did think it. I understand if it is confusing you have no idea if the person giving the admonishment "don't do this" is just sniffing glue, so I will try to explain why I think most people probably should do this.

Firstly, I think the graph in your article shows no advantage except with a cold cache. I agree with this, and think we probably disagree on how likely a "cold cache" is, or how to model that when thinking about a program. I argue you should never bother.

In a game (or other realtime scenario that runs continuously) you understand that every "n" frames you are going to spill cache "m" times, and that because the game runs continuously "m/n" averages out on some value.

I have worked code that aims for m=0, and it's a little bit like superconductivity in some respects: strange things happen with your code when it looks like that, but remember that L1 is only 64kb so if your program gets any bigger than that or ever does a system call and jumps into the kernel (which is definitely bigger than that) then m= at least 1.

Microbenchmarks tend to be m=0 which makes working with them tricky. Real programs are usually at least m=1 and usually much much bigger: m=1 is still something like 100GB (gigabytes) per second.

Now through all that, it is tempting to try and think of m as the sum of latencies, but it's important to remember that the size of the code matters too: Those instructions have to be fetched from memory just like data does, so if your program gets too big (like can happen if you naively unroll every lookup table with its own function) m is going to get bigger faster than anything else. Look at objdump. Look at how big your program gets doing this. How many multiples of 64KB have you eaten with this trick? Each one of those is another m.

That is why this trick is almost certainly never worth it, and if you ever find a big codebase that benefits from a few strategic hoists of this technique I bet putting a prefetch intrinsic in the right place would help better.

binary132 · on March 24, 2024

I recently watched a talk which covers some of the runtime dynamics you’re talking about and it’s very interesting! The mental model never adequately represents real machine architecture.

https://youtu.be/i5MAXAxp_Tw

vidarh · on March 23, 2024

You don't need to go full ECS. You can write a static member function that takes a set of objects of that class, and write a non-virtual inlineable member function, and still use your SceneManager approach to keep track of them if you like.

It doesn't feel like there's "no cost to the dev" here because developers will still sooner or later need to understand the relevant code.

Dwedit · on March 23, 2024

Branch prediction of Virtual Calls has traditionally been to look at the address of the calling instruction, and assume it will be the same target as last time.

But the real penalty for using Virtual Calls is the potential loss of inlining. It depends on how smart your compiler is.

ack_complete · on March 23, 2024

Modern CPUs support indirect branch prediction based on global branch history, they've advanced past simple predict-not-taken or predict-last for indirect calls. Intel has been doing it since Haswell.

Taniwha · on March 23, 2024

It's kind of worse than this - a virtual function call is a load (from the object) to get the vtable pointer followed by a load from an offset of that followed by a subroutine call from that pointer - it's kind of the worst thing you can do in a high performance microachitecture - because you're going to predict that branch (call) and often get it wrong, but you wont know you got it wrong until you've done the two loads (which can't be parallelised, because the address of one depends on the other). By the time you've resolved this you may be throwing away tens or a hundred instructions from your pipe if the prediction was wrong

kazinator · on March 23, 2024

Virtual function APIs have to be designed such that they don't suffer from lack of inlining.

E.g. if you're making a graphics display driver, don't have objref->putpixel(x, y) as your only virtual function, which has to be called thousands of times to draw a line or millions to fill an area.

cleak · on March 23, 2024

Cache coherency and better prediction is the main reason I went down this path in the first place. Sorting by the function to be called rather than just a grab bag of arbitrarily ordered `.Update()` calls (which is what happens in Unity, being my main reference point for this) is going to give some speed gains, even with the indirection still there. Of course eliminating some indirection is usually a win.

smallmancontrov · on March 23, 2024

They are also useful for hacking in the reverse engineering sense: an object's vtable pointer usually leads to very strong hints about what the object is, which is very important information in the very common situation where the data and code don't make it super clear :)

vitiral · on March 23, 2024

I'm amazed C++ defines a user-vieible struct of virtual methods but doesn't define a user-visible mechanism to retrieve it from a pointer. This hack of converting the pointer into the VTable seems.... concerning

Why does C++ hate it's users so much?

vidarh · on March 23, 2024

As far as I remember, C++ doesn't define any struct of virtual methods at all. Individual implementations do. Hence why he points out that by convention the vtable is before all other members, and lists some of the implementations that follows this convention.

vitiral · on March 23, 2024

Yeesh, is he both defining the VTable _and_ dereferencing the pointer into his VTable? That's... terrifying

vidarh · on March 23, 2024

It's not particularly terrifying - it's just an array of pointers, and it's well known how it's implemented for the main compilers. As I've pointed out elsewhere I don't think it's necessary though, and it definitely creates a dependency on the compiler that may or may not be acceptable.

vitiral · on March 24, 2024

I don't believe the VTable can be so simple when you have, i.e. multiple classes implementing different mixtures of common interfaces and inherited methods. All lookups have to work the same across all the different interface implementers.

vidarh · on March 24, 2024

C++ doesn't have interfaces. The vtables for any given class in at least the common implementations just adds function pointers for every new virtual member function for each subclass.

Since you (in C++) can't cast between "sibling" types and expect it to work, that's enough - each new subclass has a vtable that shares the base with all classes inheriting from the same parent, but adds different sets of new virtual member functions.

In languages with interfaces, or that are more dynamic, you typically have tree common choices:

* Add a dispatch method and call that (really expensive, but extremely flexible)

* Add an indirection and put a pointer to a vtables for each interface in the main vtable (first place I encountered that was in a paper that named it "Protocol Extension" by Michael Franz, who implemented it for Oberon)

* Make the vtable include the superset of all possible methods. My (experimental, incomplete) Ruby compiler does that. It makes the vtables larger, and at some point you'll want to assign some methods to a slower but more compact path, but it works surprisingly well.

But none of those are needed for C++.

vitiral · on March 25, 2024

I'm surprised C++ doesn't support it. I wonder how rust does (with Traits)

2716057 · on March 23, 2024

I'm wondering why the "final" keyword is not mentioned in the article.

cout · on March 23, 2024

This is a good point. Using final and a static cast to the derived type should eliminate the indirect call.

repelsteeltje · on March 23, 2024

What does "final" mean in C++? Is that new? Maybe you're confusing with java?

epcoa · on March 23, 2024

> Is that new?

No. In fact in a few short months post C++11 will have overtaken pre-C++11 as the majority of the 26 year history of standardized C++ (and similar to prior standards, compilers for the fortunate implemented much of the behavior prior to the official publication).

MichaelMoser123 · on March 23, 2024

> post C++11 will have overtaken pre-C++11 as the majority of the 26 year history of standardized C++

that's because they switched to a different 'release cycle' for C++ standards, also now they have minor releases like C++14.

vidarh · on March 23, 2024

The release cycle does not change the number of years that have elapsed before and after 2011.

epcoa · on March 25, 2024

https://blogs.embarcadero.com/what-is-modern-c-in-2023/

canucker2016 · on March 23, 2024

C++11 adds support for specifying "final" on derived virtual functions to prevent further overriding of the function by any derived classes of the current "final" class.

1letterunixname · on March 23, 2024

Yep. Imported more-or-less from Java OOP.

sudosysgen · on March 23, 2024

https://en.cppreference.com/w/cpp/language/final

Final in C++ indicates that the function/class can no longer be overridden in further child classes. This enables devirtualization as the compiler can then know the function pointer will not change, if it can infer the type at compile time.

keithalewis · on March 23, 2024

What does this do for you that the NVI idiom doesn't http://www.gotw.ca/publications/mill18.htm?

andy_xor_andrew · on March 23, 2024

on this topic, it's interesting how my mindset changes when I'm writing Rust vs Python, with regards to vtables, dispatching, and allocation.

Writing Rust for a toy project: "I MUST avoid allocation, and the dyn keyword, at all cost!!"

Writing the same toy project in Python: "lol just copy the string who cares"

gpderetta · on March 23, 2024

I complain about python slowness all the time, but in a way is liberating. As you have no chance to make the code faster, you just stop worrying about it.

nickpsecurity · on March 23, 2024

I've been learning Python recently. I found that one can make it much faster with a little work. Coming from other languages, it was also easy to write it in a way that its many accelerators, like Cython or mycpp, could use. What's the catch?

For me, there were two. First, I'm having to learn to code in the "Pythonic" style so my code will fit into existing projects. An example in a class I took was moving from while loops to for i in range(len(list)). While you won't forget to increment the counter, it felt like being so clever would reduce optimizations. Another project needed whole, string copies for conciseness vs a char-by-char algorithm that was highly-efficient. While it may be my inexperience, it seems like Pythonic style inherently favors conciseness and readibility standards over performance.

Second, the extensions seem tightly coupled with the CPython interpreter. I've read enough articles to know that's causing problems for people wanting to add capabilities. It may not matter, though, since most of us use Python for its ecosystem benefits. We can just switch tools when needed.

kukkamario · on March 23, 2024

Perspective really changes when simple integer addition can take more time than allocation in a compiled language.

ninkendo · on March 23, 2024

Hmm, assuming you’re talking about heap allocation, could you elaborate? I’m curious.

[Edit: You may have meant “when simple integer addition [in a scripting language] can take more time than allocation in a compiled language”, in which case sure, I agree. I had originally interpreted your statement as saying compiled languages can do allocation faster than they can do addition…]

Allocation in most environments I’ve seen involves keeping track of a “free list”, which does some amount of pointer chasing to find a contiguous section of memory that can fulfill the request. Deallocation drops the item from the list, etc. This can result in some fragmentation, and the segments of memory managed by the free list are less likely to be in cache because they’re somewhat dispersed compared to stack memory. Some environments have multiple free lists depending on the expected size of the allocated objects, to avoid fragmentation.

How could all this be faster than a single integer addition?

LegionMammal978 · on March 23, 2024

When it's a bigint addition, as all integer additions in Python are.

Also, even when both integers are small, Python usually has to do an allocation anyway to store the result.

cleak · on March 23, 2024

I've actually shifted most of my personal dev to Rust now, and so this vtable hack has become less relevant. Rust makes it very easy to avoid virtual functions. If I had to redo this bit of code in Rust, a trait would boil any sort of update function down to its concrete type and give (I'd expect) great performance with all the convenience of virtual functions.

mandarax8 · on March 23, 2024

Just go all out C-style, forgoing any virtual functions and instead only use struct-of-pointers instead of this weird construction.

95014_refugee · on March 23, 2024

You... haven't worked on a large codebase, have you?

The part that baffles me about this entire post is that it's trivial to obtain pointers to member functions legally, without the fragility associated with guessing VTable offsets.

int_19h · on March 23, 2024

If you're referring to C++ pointer-to-member types, the semantics of those still requires them to perform virtual dispatch at point of use. That is, given:

   struct Base { virtual void foo(); }
   struct Derived: Base ( virtual void foo(); }
   auto p = &Base::foo;
   Derived d;
   (d.*p)();

The last line must invoke Derived::foo, not Base::foo. Which in turn means that the representation of a pointer-to-member-function cannot be a simple function pointer in the most general case. The representation must store enough information to know whether it's pointing to virtual or non-virtual member, and then store either direct pointer to code or vtable offset depending on that.

Consequently, an indirect call via pointer-to-member has to do the virtual/non-virtual check first, and then look up vtable by index if virtual, before performing the actual call. Which is in fact more work than doing the virtual call directly (since the first step is unnecessary in that case), and thus pointers-to-members cannot be used to optimize away the overhead.

o11c · on March 23, 2024

> the representation of a pointer-to-member-function cannot be a simple function pointer

Sure it can, you just need to have emitted separate functions for "concrete call" (the traditional one stored in the vtable) and "virtual call" (a bit of code that calls through the vtable).

If this is not done it's probably due to the fact that actually emitting the virtual calls involves more code than just storing the vtable index. But if anyone is implementing a new language I strongly suggest paying the cost anyway, for sanity.

int_19h · on March 23, 2024

There are other complications at play for your proposed solution.

If you emit a stub for each virtual member function that performs the vtable dispatch, at which point do you emit it? If you do it if and when the address is actually taken, then pointers to members in different compilation units will have different stubs (COMDAT folding can take care of this for static linking, but not for dynamic) - which means that &Foo::bar taken in one unit will not compare equal to the same taken in another, which is counter to the C++ specification.

Alternatively, you could emit such a stub proactively for each vtable entry on the basis that its address might be taken in another compilation units - but that means a lot of unused stubs.

The other problem is multiple inheritance combined with upcasting. Consider:

   struct Base1 { ... }
   struct Base2 { virtual void foo(); }
   struct Derived: Base1, Base2 { ... }
   void (Derived::*pf)() = &Base2::foo;
   Derived d;
   (d.*pf)();

The problem here is that your stub for Base2::foo expects `this` to point at the beginning of Base2. But then when it is invoked on an instance of Derived, what's readily available at call site is a pointer to the beginning of Derived. Which is not the same, because Base2 is the second subobject of the Derived instance - so you need to adjust that pointer to Derived by sizeof(Base1) chars to upcast it to a pointer to Base2. However, you don't know at call site that your Derived::* actually points to a member of Base2 in the first place - so the information about this adjustment has to be stored in the pointer, as well, and has to be dynamically loaded and applied at call site (which is yet another step making it slower than a regular virtual call, by the way). And this cannot be done with pre-generated stubs since you don't know in advance which classes in other compilation units might inherit from Base2 later on, and what layout they are going to have.

If anyone is designing a new language, I strongly suggest not doing anything like C++ unbound member function pointers in the first place; the practical need for such a thing is very rare, since in most cases you want the pointer to be bound to a particular object (and then vtable resolution can be done at the point where the pointer is produced, rather than at the point where it's used). And, on the other hand, in those rare cases where such a thing is needed, it can be trivially done with closures if the language has those (and it should, given their much broader utility). If C++ had lambdas from the get go, I very much doubt that it would also have pointers to member functions.

o11c · on March 23, 2024

Ah right, that.

To be clear to others: the problem appears due to the fact that `R (Base::*)(A...)` can be cast to `R (Derived::*)(A...)` (note the contravariance), not just at construction time but arbitrarily later. This is unfortunately special to the `this` argument; a sane language should allow contravariance of any function pointer argument. (this implies that, if you're emitting C code, all arguments should unfortunately be `Object ` to avoid illegal casts.

I actually disagree that closures fix this though - you can easily end up with deeply nested closures which are opaque so you can't optimize them out. Rather, just make unbound functions identical to free functions.

And yes, I agree that single-inheritance + traits (~ late interfaces, ~ external vtables, which turn out to be easier to think about inlining. You can't use instanceof though) is much saner than multiple-inheritance + ADL. Honestly, I'd say that ADL is C++'s worst mistake, since it intrudes upon every* function call you make, unlike most of the crazy stuff we're doing here.

Doing without single-inheritance is in fact viable but annoying, especially when porting code or programmers. Emulating single-inheritance in terms of traits can be done by splitting each class into a concrete struct + a trait (and aggregating the structs but inheriting the traits), but this requires manually forwarding all the unchanged virtual functions, and also trait pointers are fat which is often undesirable (except note that for small types you often don't want to pay for an embedded vtable pointer). Moving that back to a class inheritance system, however ... I've often thought it is quite useful to support both "pointer to class, virtually" and "pointer to class, final".

int_19h · on March 23, 2024

> I actually disagree that closures fix this though - you can easily end up with deeply nested closures which are opaque so you can't optimize them out. Rather, just make unbound functions identical to free functions.

It would be trivial to optimize - as that would be a lambda of the form `[](Foo& foo, ...) { foo.bar(...); }` it can just be emitted as a free function with no closure, no matter how deeply nested it is. In fact, looking at it in Godbolt, it seems that compilers do this even when optimizations aren't enabled at all, probably because it's just the most obvious way to codegen.

Effectively, it just becomes a way to define a regular function in-place at the point where its address gets taken. In C++, such a lambda is even implicitly convertible to the corresponding plain function pointer type, so it's already unified with regular free functions on type system level.

Of course, you could then say that &Foo::bar is simply syntactic sugar for such a lambda. I think we agree that the important part here is to avoid having the special notion of pointers-to-member-functions in the type system; the rest is just a question of how to dress it up when you have a use case where you need one.

1letterunixname · on March 23, 2024

Exactly. Don't do this. UB is fragile.

When downshifting to C, use opaque structs to hide private data and use regular function pointers. There are numerous examples of C VTables but these tend to be slow. When possible, don't use OOP or VTables because it's slow and more difficult to debug. Using plain functional and procedural architectures is much more robust, deterministic, testable, and easier to reason about.

https://godbolt.org/z/feajn1W4j

mandarax8 · on March 23, 2024

I meant instead of casting the vtable pointer and guessing the layout (imagine doing that in a large codebase...), just define the struct yourself and disallow the virtual functions in this case specifically if you're chasing the performance that badly.

amluto · on March 23, 2024

There is one quite well known and very large codebase that does this extensively: the Linux kernel.

I have mixed opinions about it.

fwsgonzo · on March 23, 2024

80% of the code is drivers, which having written many myself, doesn't always need the latest OOP features. Abstractions suck in C for sure, but it's OK.

Meanwhile, the Windows XP codebase had 45M LOC in 2011.

Truly big codebases need to be written in sane languages, mostly OOP ones. That excludes C, Zig and Rust.

vidarh · on March 23, 2024

Virtual functions in C++ are effectively implemented as struct-of-pointers. Demonstrating how that works is a large proportion of the article.

anacrolix · on March 23, 2024

Came to see some clever Rust dyn use. Saw C++. Close tab