Seems like there's obviously a big I/O performance difference. Maybe it's something simple like buffering being setup differently or not at all or sync vs async. It'd be interesting to dig in more
Even something like mmap can drastically improve performance since it lets the kernel handle I/O asynchronously from your program execution (so your code doesn't block as much or as easily on I/O)
We don't know that it's really I/O, as in "pushing some bytes to the system". All we know is that the author saw a hot method called "write" and stopped the analysis there. It might well be something like messing around with character encodings to get those bytes in the first place.
Thinking about this more, we know that the author saw a hot "write" method in the profile for the fast run and doesn't have a profile for the slow run. The slow versions could be spending most of their time in a completely different place.
Even something like mmap can drastically improve performance since it lets the kernel handle I/O asynchronously from your program execution (so your code doesn't block as much or as easily on I/O)