I have an Actor framework which uses a vanilla std::deque for method pointers, and to add messages to the queue, the locking technique is a Benaphore (the original Futex, which uses an atomic and a locking primitive, with the twist that my locking primitive is a combo of spinlock/mutex based on retry count). Nothing special.
Benchmarks show that very rarely does the message push function block, and the chance of an OS context switch is also very low, so even though occasionally we get a locked thread swapping out, it is so infrequent that it doesn't justify the lock-free algorithm cost.
In a nutshell, non lock free queues are much faster than lock free queues, however you must be prepared to accept the very infrequent longer delay due to a context switch (where the lock is not available to anyone). My benchmarks show that the price is acceptable since so much more performance is available with non lock free queues. 10 million messages per second per work thread can be queued on modern hardware.
The "Benaphore" was a pretty old idea, but the Futex isn't just "Oh it's a Benaphore but Linux" the essential trick is that you don't actually need a kernel object - your "locking primitive" at all, and that idea is where this goes from "Yeah, everybody knows that" to OK, our OS should add this feature ASAP.
Instead of an OS synchronisation object which is used to handle conflicts, with the futex design the OS carries a list of address -> thread mappings. If a thread T is asleep on a futex at address X, the address X goes in the list pointing to thread T. When the OS is asked to wake the X futex, it walks the list and wakes T.
The give away is the limits. For something like Benaphores you're constrained, these are a costly OS wide resource, I think BeOS only allowed 65536 per machine or something. But a Futex is just memory, so there's no reason to have any limit at all.
With a post like this, however interesting, you really should link the code. Not only will it prove your claim, but others like me will find the idea intriguing and immediately want to see how it's done.
It really depends on the specifics. If there's a lot of contention then performance will drop off a cliff. Even atomic instructions can become a bottleneck ( https://stackoverflow.com/q/2538070 ).
I think what you're observing, i.e. that in many cases just use a lock and don't worry about it (or your variation) is true. But there are certain applications/situations where you can do better. Having a consumer pull out everything from the queue with one locking operation and being careful with how you signal the consumer from the producer(s), assuming there's a signal, can also make the queue more efficient/have higher throughput (e.g. you shouldn't signal for every item you put in the queue, only when it becomes non-empty).
Under a lock free algorithm, we promise that across our entire system, eventually some progress is made. I can't necessarily point at any specific thread of execution and say "This thread makes progress" but I can say that forward progress is made for the system as a whole.
Under a wait free algorithm, we promise that every individual thread eventually makes progress.
Suppose I have one really dumb thread which just goes to sleep for 10 seconds in a loop, as well as other threads which do some actual work. If we're a lock free algorithm, it is OK if the 10-second-sleep thread is always chosen to make progress, its "progress" consists of sleeping for 10 seconds, too bad, try again next time.
With a wait free algorithm, the threads which actually do useful work will also make at least some progress, eventually despite the 10 second sleeper.
Generally we can't say for sure that we e.g. "reduce impact of contention" only that we definitely make some forward progress, on at least one thread eventually.
Lock free is kind of an overloaded term that can mean a variety of things depending on what the user is thinking. Usually the goal is being able to make forward progress even if one thread is context switched out by the OS scheduler, which might be called wait-free or non-blocking. Using mutexes (unless you can prevent OS scheduling, interrupts, etc) makes this property impossible to achieve. In general, MPSC queues are super fast and there's no real reason to prefer a locked queue.
> In general, MPSC queues are super fast and there's no real reason to prefer a locked queue.
There's one significant advantage that a locked vector or deque has over MPSC/MPMC queues: the consumers can dequeue all messages in a single operation by locking the vector, swapping it with an empty vector (typically, that's just 3 words), and locking it again. That's such a simple operation that it will typically be as fast or even faster than a single pop-one-message operation in an MPSC/MPMP. Similarly, if the vector is empty, a producer can push any number of messages in a single, constant-time operation.
Depends what you mean by “the impact of contention”. It is possible, if you build a lock-free system, to end up with threads that keep churning and never make forward progress.
It's probably fair to claim that zero* game developers have this hard requirement. At best they have a strong desire which might be partially backed up by benchmarks and squinting hard enough.
*: ignoring deliberately esoteric cases, like Doom on a hard real-time system.
This is usually called "soft realtime" -- you want it to be fast but it's not wrong if it misses deadlines. It doesn't violate correctness of the system to miss a deadline. In a "hard realtime" system, it is a fatal error if the system misses a deadline because correctness is violated. I presume this difference is what OP is talking about. Games are never hard realtime; missing frame deadlines reduces user experience but it doesn't break the game. Hard realtime is things like industrial motion control, automotive and aviation electronics, pacemakers, etc.
I would say that an exception could be made for VR games where any frame rate 'jank' causes an uncomfortable experience and can lead to increased simulator sickness.
Depends how high your standards are! One of the things that makes playing old games on the original hardware really satisfying is how consistent they are.
Those old games could be so consistent because the software and hardware were both so incredibly simple compared to today.
Today, a PC game has to work on a large range of hardware that has come out over the past 5+ years. And there are GPU features that are only available on certain cards, like hardware ray tracing and things like DLSS and FSR for upscaling.
And the game engines are incredibly more complex today to handle modern expectations, with dynamic lighting and shadows, huge maps, etc.
It doesn’t matter what your standards are. Hard realtime just isn’t realistic or even possible any more, except maybe in a game that would be considered truly primitive by today’s standards.
I am sure I have memories of how hard it was to support a range of hardware when I had to write specifically for each piece. When I had to write separately for a Soundblaster card and an Adlib and a Disney SoundSource and a Roland and a Gravis.
And that was just the sound cards. Writing for different hardware became so much easier when OpenGl and DirectX came into being. Suddenly I just had to write to these APIs.
I think I'm disagreeing with you. Supporting multiple hardware configuration way back when was so much harder than doing it today.
Yeah, and Tandy graphics were different from EGA or VGA, just enough that you needed to write different display code for them. And you had to get the user to tell you which ports/IRQ numbers half of their hardware was configured at.
We have way better hardware abstractions in the OS today, which I would agree makes modern development easier overall.
Oh yes, that brings it all back. Having the user tell me the IRQ and also DMAs and... I'm sure there was more. Sometimes the user would end up having to open up the PC and reseat physical jumpers on hardware cards to have the cards use values that were both offered by the software and not clashing with anything else.
> Games are never hard realtime; missing frame deadlines reduces user experience but it doesn't break the game.
It depends on whether the game is multi-player and, if so, how it keeps the different players in sync with each other.
Games that rely on deterministic gameplay can desync (two players don't see the same world state) and abort if one player's simulation drops a frame while the other doesn't.
If you have multiplayer like this, you absolutely do not have a hard requirement on frame latency.
You don't control network latency spikes. Latency tolerance is a hard requirement, or you will have constant problems and likely be unplayable. Or custom networking and hardware stacks, which is deep into esoteric territory.
That's still just a degradation of user experience and not a fatal fault. Indeed, support for running in desynced mode is written is because they know deadlines can be missed. In hard realtime, deadlines can't be missed. There's no recovery; it's a critical fault and you have to halt or failover to a backup.
> That's still just a degradation of user experience and not a fatal fault.
I don't know how you'd describe a game spontaneously aborting not a "fatal fault". Yes, it's not turning off someone's pacemaker, but within the scope of what a game is able to do, kicking the player back to the matchmaking screen in the middle of a game is about as fatal as it gets.
That's only going to happen if the client has such a massive network latency spike that the server thinks it's disconnected. At least half a second, possibly more. You're never going to get that kind of delay from synchronizing threads, unless the process totally deadlocks.
EDIT: Well, there is one situation where you might get delays like that: if the computer is so woefully inadequate to run the game that it consistently misses frame deadlines by several hundred milliseconds. Of course, in such a situation a different concurrent algorithm wouldn't have solved anything anyway.
Literally any multiplayer game will have a "fatal fault" if its connection to the other players and/or server is interrupted for long enough, but it seems disingenuous to describe a system tolerant of delays variable across several orders of magnitude, up to hundreds of milliseconds or even full seconds, as "hard real time" in the sense in which the term is generally understood.
What I don’t understand here is how missing a frame deadline in this situation (<1/60 of a second or 16ms) is intolerable when typical network latency varies constantly by much more than this. It seems to me that if the latency requirements were this strict you could only support LAN multiplayer.
The only scenario I can imagine is some interactive game installation at Disney/etc. But there'd still be no actual benefit over an architecture more lenient to timing deviations.
> Games that rely on deterministic gameplay can desync (two players don't see the same world state) and abort if one player's simulation drops a frame while the other doesn't.
Right. This is why lockstep deterministic (LSD) games are bound to the SLOWEST player's machine.
No LSD game in existence crashes the game if one player's machine falls behind. Instead you either pause the simulation for all players until they catch up or you slow down time itself.
Source: shipped RTS games that were lockstep deterministic.
If you want to support players with ping over 16ms, there's no way you're gonna be synchronizing input from one frame to the output of the same frame for another player; there'll necessarily be some latency hiding, at which point any freeze should be coverable via just treating it as a temporarily-high ping and thus temporarily more latency hiding.
Lockstep simulation can't be hard tied to frame rate. Two computers will never have precisely the same frame timing -- even if they are running identical video modes, clocks are only accurate to within a tolerance and will diverge. The simulation has to allow for simulation ticks running independently from rendering frames.
Since you can’t control latency of individual packets let alone that the hardware timing is identical this isn’t practical as a requirement.
Deterministic gameplay means after the same x frames with the same external inputs you will end up with the same state y on different hardware. Therefore we only need to make sure the clients stay within a reasonable margin. We can do this by waiting for all players inputs so moving at the slowest speed possible which is the lock-step method. Or we can change our measured frame time so each tick (which still ticks the same amount of sim time) is seen as faster or slower. That way we can track the other frames that other clients have completed from the input they send and slow or speed ourselves up to maintain a decent margin. This is the scheme used in rollback networking as the naïve implementation where this isn’t the case forces all rollbacks onto the faster machine.
If your program state tracking/management is so tightly coupled to renderer that missing a frame can desync, then you have much bigger architectural problems anyway.
Network latency is a thing and your state management subsystem absolutely has to be fault tolerant.
Well do we have a benchmark where the price of locking is unacceptable? I think the linked project is a fun theoretical exercise, but I personally don’t accept the assertion that locking is unacceptable without seeing evidence.
The problem with vanilla std::deque is that the only acceptable implementation is in clang libc++. On MSVC it's basically an overcomplicated linked list.
When you protect an std::deque with a mutex you would need at least two atomic operations: to lock the queue before pushing, and to unlock the queue after pushing. Because you're using an std::deque it may need to allocate memory during a push, which would happen under the lock, which makes it more likely for a thread to suspend with the lock taken. While the queue is locked other threads will have to wait, possibly even suspend on a futex, and then the unlocking thread would have to wake another thread up.
The most expensive part of any mutex/futex is not locking, it's waking other threads up when the lock is contended. I'm actually surprised you only get 10 million messages per second, is that for a contended or an uncontended case? I would expect more, but it probably depends on the hardware a lot, these numbers are hard to compare.
My actor framework currently uses a lockfree intrusive mailbox [1]_, which consists of exactly two atomic exchange operations, so pushing a node is probably cheaper than with a mutex. But the nicest part about it is how I found a way to make it "edge triggered". A currently unowned (empty) queue is locked by the first push (almost for free, compared to a classic intrusive mpsc queue [2]_ the second part of push uses an exchange instead of a store), which may start dequeueing nodes or schedule it to an executor. The mailbox will stay locked until it is drained completely, after which it is guaranteed that a concurrent (or some future) push will lock it. This enables very efficient wakeups (or even eliding them completely when performing symmetric transfer between actors).
I actually get ~10 million requests/s in a single-threaded uncontended case (that's at least one allocation per request and two actor context switches: a push into the target mailbox, and a push into the requester mailbox on the way back, plus a couple of steady_clock::now() calls when measuring latency of each request and checking for soft preemption during context switches). Even when heavily contended (thousands of actors call the same actor from multiple threads) I still get ~3 million requests/s. These numbers may vary depending on hardware though, so like I said it's hard to compare.
In conclusion it very much depends on how lockfree queues are actually used, and how they are implemented, they can be faster and more scalable than a mutex (mutex is a lockfree data structure underneath anyway).
I'd agree with you in that mutexes are better when protecting complex logic or data structures however, because using lockfree interactions to make it "scalable" often makes the base performance so low, that you'd maybe need thousands of cores to justify the resulting overhead.
The memory chunks it allocates to store entries are so small that it ends up allocating one chunk for every element for almost every type. The result is a lot of allocation overhead, pointer indirection and various other things that are bad for performance.
The lockfree scheduler certainly looks interesting (especially linearizability of event broadcasts), but I was surprised to see benchmark results in the paper with a peak of 43500 messages/s for 12 pairs of actors (and 12 cores?), with a graph showing ~5000 messages/s for a single core, which is surprisingly low for that kind of benchmark. Unfortunately the engine requires linux and more importantly x86 (due to asm intructions) so I wasn't able to replicate them yet, but I would expect at least ~1 million requests/s per a pair of actors (e.g. with Erlang), otherwise the overhead is prohibilitely low.
The engine also focuses on message passing, but from experience it's very difficult to work with (state machines are hard, especially when working with multiple downstream actors), and at the core actors are more about isolating state without locks than message passing. Swift actors did it right in my opinion, method calls instead of messages are not only easier to reason about, they give additional hints to the runtime when context may switch without involving a scheduler at all (any shared state is slow and inhibits scalability).
I actually wrote a header-only library recently (search for "coroactors" if you're interested) that implements something similar to Swift actors with C++20 coroutines, and I thought ~10 million requests/s (when uncontended) or ~1-3 million/s (when contended and depending on a scheduler) was a way too high of an overhead, especially when compared to normal method calls with shared state protected with mutexes. Coroutines tend to go viral (with more and more functions becoming "async" coroutines), and any non-trivial code base would have a lot of coroutine calls (or messages passed), and that overhead needs to be as low as possible, otherwise you'd spend more time task switching than doing useful work.
It says that it's actor-based, and sending messages to an actor is equivalent to running the actor function under a mutex, and thus reduces parallelism (in other words, you have N threads sending messages, but only 1 thread running the actor code, so it's just as serialized as a mutex), so while it may technically be "fully lock-free", using actors means that there is no parallelization improvement.
Not quite; with mutex based actors an interruption of the actor thread would cause that mutex (and thus that actor code) to remain locked until the original thread resumes. No additional parallelism will allow that actor code to be "restarted" or "resumed" as the mutex owned by the interrupted thread is locked.
This implementation relies heavily on restartable functions in order to allow another parallel thread to pick up and continue the work of an already in progress but otherwise interrupted actor. See page three of the (excellent) design document: https://github.com/eduard-permyakov/peredvizhnikov-engine/bl...
Thus while it might not strictly be "more parallel" (same number of actors), it does seem to be able to make better use of more parallelism for completing the same set of work.
> sending messages to an actor is equivalent to running the actor function under a mutex
Where does it say that? In my understanding actor model means message-passing with asynchronous execution. So quite the contrary, actor model allows N threads executing in parallel given N actors.
Whenever I read any press about actors or goroutines, they say the same thing about preventing races (and the need for explicit locking) through share-nothing concurrency.
It's easy to scatter the computations, but they never go on to explain how to gather them back up.
You're going to render one frame to the screen. Did multiple actors have a write-handle into the frame buffer? What about collisions, does each entity collide with itself, without sharing position information with other actors?
Your right that there is no parallelization improvement, but it does not reduce parallelism either, its just a different(imo easier) way to think about concurrency.
Because it is easier for me to think about it is easier for me to see where things will contend the same resource and actually helps me improve potential parallelism. Once you recognize a particular opportunity where SMP can speed things up, you can stray a way from the actor-model a bit and have multiple threads receiving on your message queue, or if that isn't possible, you can just add more actors and split up the data better.
Does anyone have experience debugging/profiling highly contended critical sections of STM vs a more traditional mutex implementation? At the end of the day something has to mediate concurrent access to shared memory, there’s no free lunches, and mutexes are so well optimized, profiled, and understood. I’m unclear if the same applies to STM where a transaction may need to be retried an unbounded(?!) number of times.
Yes. The mediator in this case is the scheduler. The one that actually calls the asynchronous block. Potentially retrying it if it fails. In the OP’s code, there’s atomic blocks, sequential execution of blocks, stateful blocks, etc for ensuring singular access at a time.
The meat here is scheduler.cpp. It uses std::coroutines.
This is like async/await in other languages. The scheduler has a queue of work(coroutines) and a pool of threads(N>0) to execute those on.
In this case, messages are passed between work that contains the data. No locks are required at the expense of memory footprint.
Actually, for starvation-free STMs, transactions will retried a _bounded_ number of times. One example is 2PLSF, but there are several others https://zenodo.org/record/7886718
I don't have time to read through the impl, but the readme makes it sound like a classic distributed system between game threads, where patterns like retry-backoff will be common.
As cool as lock free sounds, in my opinion any amount of non-trivial atomics usage should be accompanied by a formal, ideally machine checked, proof. Non sequential consistent atomic ordering usage is so difficult to get right. I've repeatedly seen code that gets it wrong, and the bugs that creates are the worst.
Where is the game demo? Also to be considered gaming engine nowadays, you need actual tools, exporters (Maya, 3DSMax, etc.) + who knows what else (collaboration tooling, metrics, alerts)
Peredvizhniki [0] were a group of Russian realist painters from the 19th century. Among other things, they organized "traveling" art exhibitions to promote Russian art in the provinces. The name roughly translates as "The travelers".
The only thing that seems strange is “such commands as” used to describe a single command. I think a more common way of writing it would be “a command like.” But, I don’t think it is at all definitive, it could just be a native English speaker who went with a slightly odd phrasing.
I guess that my pun using "unreal" in a post about a game engine went under the radar and now I look like I can only speak English which is kinda flattering and worrying.
The Russian word can be broken into three sections. The first part is a prefix associated with a transition. The middle means movement and the bit at the end means a person that does that. So even though the group of artists with that name were translated as "the wanderers" or "the itinerants", the word literally means people who move around.
Can someone please expand on the significance of this achievement to someone used to shooting their foot off in C++ in a predominantly single threaded manner?
AFAIU an артель artelʼ usually did not engage in collective bargaining as such, so calling it a union is not really accurate; you might compare it to a guild but I think there’s no implicaton of a monopoly on a particular trade either. The most accurate description I can think of is perhaps a cooperative combined with a mutual insurance fund.
Fun fact - this was painted about 8 to 10 years after the abolition of serfdom in the Russian Empire. Coincidentally, it happened two years prior to the US Emancipation Proclamation. I've wondered from time to time how linked the two events were, if at all.
As an addendum - a fair portion here should be in one way or another familiar with Ilya Repin's Reply of the Zaporozhian Cossacks (aka Zaporozhian Cossacks are Writing a Letter to the Turkish Sultan)
https://en.m.wikipedia.org/wiki/Reply_of_the_Zaporozhian_Cos...
This. There’s no demonstration of any game made with this “game engine”. The benchmarks are some matrix multiplication and unimpressive message passing.
There might be some cool data container ideas or primitives. Those could have useful applications. But it isn’t really a “game engine”. Nor does it seem like an interesting way to build one.
I wouldn't even call it clever. The Actor model trivializes avoiding locks, since you're letting everything act on stale information (if the actors process concurrently) and then just iterating through their intended actions against a singleton state and resolving those conflicts with sequential code.
If what you are building isn't an actual house, or an actual game engine, then how could anybody claim they had build that particular house, or that particular game engine?
If what you were working on--for the entire time you were working on it--was a completely different thing from a house or from a game engine, how could you clame to have built a house or a game engine? Putatively, what you were working on what an entirely different thing from the result.
Can you point at a house and say “I designed this house?” If the house didn’t have some kind of existence while you were even designing it, how could it be that you designed that, particular house?
I feel like it's a mistake to consider mobile games as being part of the same market as PC/console games.
Sure they're both "games" but I don't know that they're competing for the same set of users - either people play one or the other, or the people that do crossover in both markets are probably playing PC/console games at home and mobile games on the bus or train.
And what about actual game sales and revenue through things like GamePass. It doesn't help your game if a million people bought a Nintendo Switch if 80% of them only buy Pokémon, Mario Kart and Zelda.
Pretty sure they already succeeded. Wasting effort supporting an advertising platform like windows is irrelevant to whether or not they successfully made a lock-free game engine.
It’s licensed under GPL3, where the developer suggests individually negotiating a license for commercial projects. I am not sure this is a good approach given the early stage of this project. Video game development is already commercially and technically risky enough.
I cannot see game developers lining up to even try out this unproven technology when they have no sense of what the eventual fees will be.
The source code of Peredvizhnikov Engine is freely available under the GPLv3 license. However, I may grant permission to use parts or all of the code under a different license on a case-by-case basis. Please inquire by e-mail.
Actors is one of the worst programming models possible. It's hard to observe actor-based systems and debug them. The protocol complexity (and complete protocol informality) usually brings much more trouble than any kind of locking/synchronization.
I’ve had good success with actor based models. When you say “protocol complexity”, I don’t understand what you mean.
Observation requires good logging, but this isn’t out of line for any complex system. Debugging (as in actual breakpoints in an IDE or post-mortem analysis) can be facilitated with stack tracking (this same problem occurs with async await patterns and is solved in a similar way).
The advantages and disadvantages exist, but I think it’s an extremely effective programming model for many use cases. Formal state-machine programming, that’s the worst model, unless you need it.
Actors tend to lead to bloated incomprehensible code. Even when you do want game objects to look like independent message passing actors. Faking it is always easier.
Actors are self defeating because they turn everything into a distributed system.
That's what I was thinking. I'm trying not to write Actors off because I'm not an expert on game dev and can imagine the possibility of an unusually large-scaled game. But yeah, if you're writing a typical game, I imagine you're best off doing it the typical way that's been optimized for.
Forgive me if it’s common knowledge, but what’s wrong with SDL2 in this case? Broadly speaking, I feel as though there is largely positive sentiment around the SDL project, no?
I have nothing against the SDL project. I have everything against the DirectMedia style api of 1997. I get that people are actively developing with it and that roadmap towards 3.0 is getting close but my experience is it’s designed for legacy games, legacy rendering styles, legacy ABI’s. No one is shipping games for Dreamcast or PS2.
2/3rds of the library is dead code when working with Vulkan.
For what? Working with Vulkan? Or a windowing library? Personally I use a mix of glfw and native windowing on mobile, which is surprisingly simple for a gfx context window and this gets me to triangle.
SDL2 is about the same amount of boilerplate code only in SDL_Thing form.
Now, if I was writing a game that needed to be shipped on everything possible and I can’t afford Unity or Unreal, SDL is a viable choice. MonoGame, mine, and others have used it, it’s battle tested.
I just can’t look at SDL code anymore and say “this is the cleanest api” for anything outside of C99.
As with most SDL usage, it merely provides a rendering context and an input abstraction.
That's almost nothing, its boilerplate you don't want to write. Which particular lib you grab that boilerplate from, SDL, GLFW, fucking GLEW, whatever, it doesn't matter. SDL is a widely accepted library for doing so and it's fine. If you want more complete input handling you'll be using the platform APIs directly.
Many, arguably most, custom game engines never internalize that boilerplate. Most engine codebases are much smaller and more specialized than the AAA players, and gain nothing from splitting their build into platform-specific layers when they could just link SDL.
In a nutshell, non lock free queues are much faster than lock free queues, however you must be prepared to accept the very infrequent longer delay due to a context switch (where the lock is not available to anyone). My benchmarks show that the price is acceptable since so much more performance is available with non lock free queues. 10 million messages per second per work thread can be queued on modern hardware.