Sure, and that's why there's more and more "trusted" hardware to try and get computers to a place where their users cannot read and write to or from their own memory.
Those kinds of things tend to be their own undoing.
You added a security processor to your hardware at ring -2, but hardware vendors are notoriously bad at software so it has an exploit that the device owner can use to get code running at ring -2. Congrats, your ring 0 anti-cheat kernel module has just been defeated by the attacker's code running on your "trusted" hardware.
But in the meantime you've now exposed the normal user who isn't trying to cheat to the possibility of ring -2 malware, which is why all of that nonsense needs to be destroyed with fire.
IOMMU gives the PCIe device access to whatever range of memory it's assigned. That doesn't prevent it from being assigned memory within the address space of the process, which can even be the common case because it's what allows for zero-copy I/O. Both network cards and GPUs do that.
An even better example might be virtual memory. Some memory page gets swapped out or back in, so the storage controller is going to do DMA to that page. This could be basically any memory page on the machine. And that's just the super common one.
We already have enterprise GPUs with CPU cores attached to them. This is currently using custom interconnects, but as that comes down to consumer systems it's plausibly going to be something like a PCIe GPU with a medium core count CPU on it with unified access to the GPU's VRAM. Meanwhile the system still has the normal CPU with its normal memory, so you now have a NUMA system where one of the nodes goes over the PCIe bus and they both need full access to the other's memory because any given process could be scheduled on either processor.
We haven't even gotten into exotic hardware that wants to do some kind of shared memory clustering between machines, or cache cards (something like Optane) which are PCIe cards that can be used as system memory via DMA, or dedicated security processors intended to scan memory for malware etc.
There are lots of reasons for PCIe devices to have arbitrary physical memory access.
I feel like in pretty much every case here they still do not need arbitrary access. The point of DMA cheating is to make zero modification of the target computer. The moment a driver needs to be used to say allow an IOMMU range for a given device, the target computer has been tainted and you lose much of the benefit of DMA in the first place.
Does a GPU need access to memory of a Usermode application for some reason, okay, the GPU driver should orchestrate that.
> We haven't even gotten into exotic hardware that wants to do some kind of shared memory clustering between machines, or cache cards (something like Optane) which are PCIe cards that can be used as system memory via DMA, or dedicated security processors intended to scan memory for malware etc.
Again, opt-in. The driver should specify explicit ranges when initializing the device.
> I feel like in pretty much every case here they still do not need arbitrary access.
Several of those cases do indeed need arbitrary access.
> The moment a driver needs to be used to say allow an IOMMU range for a given device, the target computer has been tainted and you lose much of the benefit of DMA in the first place.
The premise there being that the device is doing something suspicious rather than the same thing that device would ordinarily do if it was present in the machine for innocuous reasons.
> Does a GPU need access to memory of a Usermode application for some reason, okay, the GPU driver should orchestrate that.
Okay, so the GPU has some CPU cores on it and if the usermode application is scheduled on any of those cores -- or could be scheduled on any of them -- then it will need access to that application's entire address space. Which is what happens by default, since they're ordinary CPU cores that just happen to be on the other side of a PCIe bus.
> Again, opt-in. The driver should specify explicit ranges when initializing the device.
What ranges? The security processor is intended to scan every last memory page. The cache card is storing arbitrary memory pages on itself and would need access to arbitrary others because any given page could be transferred to or from the cache at any time. The cluster card is presenting the entire cluster's combined memory as a single address space to every node and managing which pages are stored on which node.
And just to reiterate, it doesn't have to be anything exotic. The storage controller in a common machine is going to do DMA to arbitrary memory pages for swap.
Re everything above the below, you are naming esoteric reasons for allowing unfettered access to physical memory. That's fine, but what percent of players of X game are going to have such a setup in their computer? Not enough that detecting that and preventing you from accessing a server would be a problem.
> And just to reiterate, it doesn't have to be anything exotic. The storage controller in a common machine is going to do DMA to arbitrary memory pages for swap.
I'd like a source for that if you have one. I'd be very surprised if modern IOMMU implementations with paging need arbitrary access. The CPU / OS could presumably modify the IOMMU entries prior to the DMA swap. The OS is still the one initiating a DMA transaction.
> That's fine, but what percent of players of X game are going to have such a setup in their computer?
If the "put some CPU cores on the GPU" thing becomes popular, probably a lot.
> I'd like a source for that if you have one. I'd be very surprised if modern IOMMU implementations with paging need arbitrary access. The CPU / OS could presumably modify the IOMMU entries prior to the DMA swap. The OS is still the one initiating a DMA transaction.
Traditional paging implementations didn't use IOMMU at all -- a lot of machines don't even physically have one, and even the ones that have one, that doesn't mean the OS is using it for that. It might end up going through it if you have something like the storage controller is mapped as a device to a VM guest and then the host uses the IOMMU to map the storage controller's DMA to the memory pages corresponding to what the guest perceives as its physical memory, or things along those lines.
But remapping the pages for each access, even if theoretically possible, would be pretty expensive. Page table operations aren't cheap and have significant synchronization overhead, and to swap a page that way would require you to both map the page and then almost immediately do another operation to unmap it again. For each 4kB page, since they're unlikely to be contiguous. You can do the math on how many page table operations that would add if you were swapping in, say, 500MiB, which a modern SSD could otherwise do in tens of milliseconds. Notice in particular that this would make operating systems that do this get lower scores in benchmarks. And that this applies not just to swap as a result of being out of memory, but ordinary file accesses which are really a swap to the page cache.
You could also run into trouble if you tried to do that because the IOMMU may only support a finite number of mappings, or have performance issues if you create too many. Then you get a slow device with too many pending I/O operations and the whole system locks up.
And even if you paid the cost, what have you bought? The OS could still give a device access to any given memory page for legitimate reasons and you have no way to know if the reason was the legitimate one or the user arranged for those circumstances to exist so they could access the page.