Hacker News new | past | comments | ask | show | jobs | submit login

"I've seen this repeated a lot, but have not had quite the same experience with "permanent" performance degradation. Especially if I eventually expand the pool with another vdev. Not sure about ZFSonLinux"

Look, I'll admit that we haven't done a lot of scientific comparisons between healthy pools and presumed-wrecked-but-back-below-80-percent pools ... but I know what I saw.

I think if you break the 90% barrier and either: a) get back below it quickly, or b) don't do much on the filesystem while it's above 90%, you'll probably be just fine once you get back below 90%. However, if you've got a busy, busy, churning filesystem, and you grow above 90% and you keep on churning it while above 90%, your performance problems will continue one you go back below, presuming the workload is constant.

Which makes sense ... and, anecdotally, is the same behavior we saw with UFS2 when we tune2fs'd minfree down to 0% and ran on that for a while ... freeing up space and setting minfree back to 5-6% didn't make things go back to normal ...

I am receptive to the idea that a ZIL solves this. I don't know if it does or not.




The magic threshold is 96% per meta slab. LBA weighting (which can be disabled with a kernel module parameter or its equivalent on your platform) causes metaslabs toward the front of the disk to hit this earlier. LBA weighting is great for getting maximum bandwidth out of spinning disks. It is not so great once the pool is near full. I wrote a patch that is in ZoL that disables it on solid state disk based vdevs by default where it has no benefit.

That being said, since rsync.net makes heavy use of snapshots, the snapshots would naturally keep the allocations in metaslabs toward the front of the disks pinned. That would make it a pain to get the metaslabs back below the 96% threshold. If you are okay with diminished bandwidth when the pool is empty (assuming spinning disks are used), turn off LBA weighting and the problem should become more manageable.

That said, getting data on the metaslabs from `zdb -mmm tank` would be helpful in diagnosing this.


You really shouldn't run non-CoW file systems above 90%, to include UFS and ext


Agreed. I don't think anyone is arguing that you shouldn't do it.

What I believe, and what I think others have also concluded, is that it shouldn't be fatal. That is, when the dust has settled and you trim down usage and have a decent maintenance outage, you should be able to defrag the filesystem and get back to normal.

That's not possible with ZFS because there is no defrag utility ... and I have had it explained to me in other HN threads (although not convincingly) that it might not be possible to build a proper defrag utility.


My understanding is that the way to defrag ZFS is to do a send and receive. Combined with incremental snapshotting, this should actually be realistic with almost no downtime for most environments.

Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem.


"Doing so requires that you have enough zfs filesystems in your pool (or enough independent pools) that you have the free space to temporarily have two copies of the filesystem."

Yes, and that is why I did not mention recreating the pool as a solution. If your pool is big enough or expensive enough, that's still "fatal".


You ought to define what is fatal here. The worst that I have seen reported at 90% full is a factor of 2 on sequential reads off mechanical disks, which is acceptable to most people. Around that point, sequential writes should also suffer similarly from writes going to the inner most tracks.


(1) I'm not proposing recreating the pool - I'm proposing an approach to incrementally fixing the pool in an entirely online manner.

(2) If your pool is big enough/expensive enough, surely you've also budgeted for backups.


(1) Regardless of what you call it, it means having enough zpool somewhere else to zfs send the entire (90% full) affected zpool off to ... that might be impossible or prohibitively expensive depending on the size of the zpool.

(2) This has nothing to do with backups or data security in any way - it's about data availability (given a specific performance requirement).

You're not going to restore your backups to an unusable pool - you're going to build or buy a new pool and that's not something people expect to have to do just because they hit 90% and churned on it for a while.


You can send/receive to the same zpool and still defrag. With careful thought, this can be done incrementally and with very minimal availability implications.

I agree it's not ideal to have filesystems do this, but it also simplifies a lot of engineering. And I think direct user exposure to a filesystem with a POSIX-like interface is a paradigm mostly on the way out anyway, meaning it's increasingly feasible to design systems to not exceed a safe utilization threshold.


This does work.


On UNIX, there are two defragmentation utilities:

`tar` and `zfs send | zfs recv`.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: