x86 is really the oddity here -- it has its no-explicit-cache-maintenance design because of wanting to maintain backwards-compatibility with self-modifying code that was written for x86 cores that had no caches at all. Almost all other architectures have explicit cache maintenance because it's more efficient (and requires less hardware), at the minor cost of requiring the very few bits of software which do odd things like JITting to explicitly tell the CPU what they're doing.
An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere. The obvious observation from a RISC-architecture point of view is that you can get the equivalent effect without the pain of making a long-running interruptible instruction, by having an "invalidate one cache line" instruction plus an explicit loop in the code, and that's what most architectures do.
Actually, Intel explicitly broke backwards-compatibility starting with the Pentium, by adding the hardware to make SMC work without additional effort. The 486 and below needed an explicit branch to flush the prefetch queue, and this effect has been exploited for various anti-debugging tricks and even this amazing 8088-only optimisation:
An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere.
x86 has the REP prefix for this purpose; used with certain instructions, it decrements a register and if it's nonzero, executes the instruction. The earlier implementations simply didn't update the instruction pointer in this case so the CPU would repeatedly fetch and execute the same instruction, and it's interruptable between each step. The register counts down how many iterations remain. Otherwise, the instruction pointer moves to the next instruction. Modern x86 handles this by generating uops instead in the decoder, but the basic functionality is the same.
An instruction for flushing an entire region would potentially have a very long execution time, which is awkward because you would want to be able to interrupt and resume it. So it would need "how far have I got" state stored somewhere. The obvious observation from a RISC-architecture point of view is that you can get the equivalent effect without the pain of making a long-running interruptible instruction, by having an "invalidate one cache line" instruction plus an explicit loop in the code, and that's what most architectures do.