Minor note - RISC-V's Zba (included in RVA22, which afaik is set to be the baseline requirement for Android) has sh2add for a combined x+y*4, getting that store down to two instructions (testable with -march=rv64gczba); doesn't affect the overall point of needless clobbering of registers from the instruction fusion approach though (and shouldn't indexed loads be even worse, having three potential output registers in baseline RV64GC, and still two with Zba?)
That's the same approach that gets you from 2 to 1 clobbered registers for stores, no? i.e. usually possible, but needs compiler participation; if assuming the compiler will optimize load register usage properly, it shouldn't have any problem reducing the store clobbers from 2 to 1 either, which I understood as the difficult part. (or course 1 clobber is still worse than 0, but at least it means no 2 writes/fused instr, at least here)