This has nothing to do with custom hardware, it's one of the few differences that's actually due to ARM vs x86. Though, part of it is from being a unified memory SoC.
I heard in an interview with Craig Federighi, that in designing the Apple Silicon chips they had looked at specific operations that were frequently used by their OS and software and optimized those on their chips. Reference counting was one of those operations and that was why the speed of that operation was so much faster on Apple Silicon. Sure those other things about unified memory contribute speed to all kinds of operations, but they really focused on some of them to let their hardware and software work well together.
Choosing to use ARM and unified memory is part of designing the SoC, though. There is no magic in the refcount machinery; you can disassemble it and look at it.