macOS has done this in recent versions. Similarly it will do all the virtual background and bokeh stuff for webcams outside of the (typically horrific) implementations in video conferencing apps.
As others have noted, this is trivial for most macOS and iOS apps to opt in to.
Frankly, I imagine its also available at the system level on Windows (and maybe Android and Linux) but probably only among applications that happen to be using certain audio frameworks/engines.
It doesn't seem to me that module-echo-cancel in Pulseaudio completely meets the requirements here (only one source), but it looks close, and seems in general like where you would implement something like this.
I think module-null-sink and module-loopback could be used to create a virtual source which combines multiple sources, though the source/sink thing makes my head spin. Or, more simply, I suppose using the loopback of whatever audio output device does the combination (and the same mixing) for you, if you play all audio through one output device (which is most likely)?
Which Operating systems do this?