Coming from OpenStack land where Ceph is used heavily, it's well known that you shouldn't run production Ceph on VMs and that CephFS (which is like an NFS endpoint for Ceph itself) has never been as robust as the underlying Rados Block Device stuff.
They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.
I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.
Also -- if you look at the infra updated from the linked article, they mention something about 3M updates/hour to a pg table ([1], slide 9) triggering continuous vacuums. This feels like using a db table as a queue which is not going to be fun at moderate to high loads.
It's not updates, it's just querying, updates are not that bad, the main issue there was that the CI runner was keeping a lock in the database while going for the filesystem to get the commit, this generated a lot of contention.
Still this is something we need to fix in our CI implementation because, as you say, databases are not good queueing systems.
> They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.
We have been in contact with RedHat and various other Ceph experts ever since we started using it.
> I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.
Users can self host GitLab. Using some complex custom block storage system would complicate this too much, especially since the vast majority of users won't need it.
You're right. We talked to experts and they warned us about running Ceph on VMs and we tried it anyway, shame on us.
You do need either a distributed FS (GitHub made their on with Dgit http://githubengineering.com/introducing-dgit/, we want to try to reuse an existing technology) or buy a big storage appliance.
Bingo! Seasoned developers and architects with 15-20+ years of experience would very likely question using software stacks like CephFS with warnings on it's website about production use!
You really want no exotic 3rd party stuff in your design, and plain-Jane components like ext3 and Ethernet switches.
Choosing a newer exotic distributed filesystem may really come back to bite you in the future.
They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.
I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.