"I've seen this repeated a lot, but have not had quite the same experience with ...

ryao · on May 15, 2016

The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.

kev009 · on May 15, 2016

You really shouldn't run non-CoW file systems above 90%, to include UFS and ext

rsync · on May 15, 2016

Agreed. I don't think anyone is arguing that you shouldn't do it.

What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.

That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.

DanielDent · on May 15, 2016

My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.

Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.

rsync · on May 15, 2016

"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."

Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".

ryao · on May 15, 2016

You ought to define what is fatal here. The worst that I have seen reported at 90% full is a factor of 2 on sequential reads off mechanical disks, which is acceptable to most people. Around that point, sequential writes should also suffer similarly from writes going to the inner most tracks.

DanielDent · on May 15, 2016

(1) I'm not proposing recreating the pool - I'm proposing an approach to incrementally fixing the pool in an entirely online manner.

(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.

rsync · on May 15, 2016

(1) Regardless of what you call it, it means having enough zpool somewhere else to zfs send the entire (90% full) affected zpool off to ... that might be impossible or prohibitively expensive depending on the size of the zpool.

(2) This has nothing to do with backups or data security in any way - it's about data availability (given a specific performance requirement).

You're not going to restore your backups to an unusable pool - you're going to build or buy a new pool and that's not something people expect to have to do just because they hit 90% and churned on it for a while.

DanielDent · on May 16, 2016

You can send/receive to the same zpool and still defrag. With careful thought, this can be done incrementally and with very minimal availability implications.

I agree it's not ideal to have filesystems do this, but it also simplifies a lot of engineering. And I think direct user exposure to a filesystem with a POSIX-like interface is a paradigm mostly on the way out anyway, meaning it's increasingly feasible to design systems to not exceed a safe utilization threshold.

ryao · on May 15, 2016

This does work.

Annatar · on May 15, 2016

On UNIX, there are two defragmentation utilities:

`tar` and `zfs send | zfs recv`.