The enhanced 6502 derivative in the PC Engine/TurboGrafx 16 games console of the early 90's was enhanced with block move instructions (MVI, MVN?) that worked similarly I think. (Hu62C80 or similar was the CPU name...)
Transfer Alternate Increment (TAI), Transfer Increment Alternate (TIA), Transfer Decrement Decrement (TDD), Transfer Increment Increment (TII)
For contrast 5 years older 80286 already did 'rep movsw' at afaik 2 cycles per byte. 6 years later Pentium did 'rep movsd' at 4 bytes per cycle. Nowadays Cannonlake can do 'rep movsb' full cachelines at a time at full cache/memory controller speed.
Well the Z80 was worse. 21 cycles per byte! The reason is that instead of running a loop in microcode, it decremented PC by 2, then fetched the instruction again every time.