A big healthcare company I worked for did this. It worked extremely well, though it wasn't without its drawbacks. They adopted the database-per-tenant pattern in the early '00s, and I truly think it was one of the major things that allowed them to scale to a large number of (increasingly larger in data/access patterns) clients. It also made regulatory compliance a bit easier (everyone's data is pretty firewalled off at the database-access-credentials layer) I think, but that wasn't really my department.
We ended up in the "thousands to tens of thousands" of clients range, with thousands of tables per client and a pretty hairy schema.
Each customer had their own schema on one of a few dozen giant database servers. The company pushed this idea out to other parts of their infrastructure: separate webserver/message broker/cache tiers existed for each underlying database server, so outages or brownouts in one component couldn't affect other customers' data that much.
Schema migrations, interestingly, weren't much of a problem. The practice was nailed down early of "everyone gets migrated during a release, no snowflakes". That, plus some pretty paranoid tooling and an acceptable-downtime (well, on paper it wasn't acceptable, but everyone kinda understood that it was) in seconds-to-minutes during a release made migrations roughly as traumatic as migrations anywhere I've worked (which is to say, "somewhat"), but not too much more. It did take a lot of work to get the tooling right across multiple schemas in the same database server though. Investment in tooling--up to and including dedicated teams working on a single tool without a break-up date or firm mandate other than "make this not suck and keep the lights on"--is critical here, as in most areas.
Things that were hard:
- Connection management. Others on this thread have pointed that out. Connection pooling and long-lived queue workers were essential, and the web/request tier couldn't "scale out" too far without hitting connection limits. Scheduled jobs (and this company loved cron jobs, thousands of distinct invocations per tenant) were a problem in the connection-management department. Carefully written tooling around webserver connection reuse, cron-job execution harnesses (they didn't really run as cron jobs, they got shipped to a worker already running the code with warm/spare database connections and run there--all highly custom), and asynchronous jobs was needed. That occupied a team or three for awhile.
- The "whale" problem. When an individual tenant got big enough to start crowding out others on the same database server, it caused performance problems. We eventually worked on a migration tool that moved a client's entire footprint (and remember, this isn't just databases, but webs/caches/queue worker hosts/etc.) onto another shard. Building this tool was a lot of work, but when it was done it worked surprisingly well. My advice in this area: build a good manually-initiated/semi-supervised migration system. Leverage underlying database technology (binlog based replication). Don't hesitate to get very dirty and custom with e.g. replication logfile formats, and don't assume that $off_the_shelf_data_replicator isn't gonna collapse when you want to do online per-schema replication in massive parallel from the same database (not even if that tool cost you millions of dollars). Do NOT succumb to the allure of "we can automate the bin-packing and it'll constantly rearrange clients' datasets for optimal resource usage!" Manual is just fine for data migrations that big. Worst case, part of someone's job is to initiate/supervise them.
- SPOFs sucked. Some datasets weren't per-tenant at all; sometimes client companies merged together or split up; some data arrived intended for a tenant but wasn't tagged with that tenant's ID, so it would have to go into some separate database before it found a home. These systems were, bar none, the biggest liabilities, causes of production issues, and hardest things to code around in the entire company. You'd think that having to write application code for thousands of logical databases across all the per-tenant schemas would suck, but in reality it wasn't too hard. It was making sure your code didn't accidentally talk to a SPOF that was the problem. My advice here: microservices do not help with this problem. HTTP, gRPC, or raw database wire protocol: if you have dependencies on a "tragedy of the commons"-type used-by-everyone server sneaking into your nicely sliced up per-tenant architecture, those callsites are going to be the cause of your sleepless nights. Get good visibility into where they occur. Favor "push" into per-tenant models over per-tenant code doing a blocking "pull". Even if the push approach causes massive additional complexity and work. The costs of pull are too great.
- Some database specific shit (even on polished hosted offerings from AWS, or big-budget Oracle installs) will start acting really squirrely when you're talking to thousands of identical schemas on the same database server (and thus tens or hundreds of thousands of identical tables with different data). If you double down on this route, be prepared to have a few really, really good database folks on staff. I don't mean "help me fix my slow giant reporting query" people, I mean "familiar with the internals" folks. Example: query plans can be cached based on query text, globally, across an entire database server. Different schemas have super different clients, and thus super different data distribution among 100s of GBs of data. The plan that gets cached for query X against client A is the product of running heuristics/histograms/etc. across client A's data. That plan might perform pathologically when query X runs against client B (on the same database server)'s data, and finding out how/why is really annoying. Solution: bust the cache by a) happening to know that SQL comments aren't stripped from query texts before the text is used as a plan-cache key and b) prepend each tenant's identifier to each query at the database-driver level to prevent cache pollution. Result: you have traded a spooky query performance issue for a query-plan-cache-size issue; now your queries are all predictably slow because all your tenants' different queries are thrashing the plan cache. Tradeoffs abound.
We ended up in the "thousands to tens of thousands" of clients range, with thousands of tables per client and a pretty hairy schema.
Each customer had their own schema on one of a few dozen giant database servers. The company pushed this idea out to other parts of their infrastructure: separate webserver/message broker/cache tiers existed for each underlying database server, so outages or brownouts in one component couldn't affect other customers' data that much.
Schema migrations, interestingly, weren't much of a problem. The practice was nailed down early of "everyone gets migrated during a release, no snowflakes". That, plus some pretty paranoid tooling and an acceptable-downtime (well, on paper it wasn't acceptable, but everyone kinda understood that it was) in seconds-to-minutes during a release made migrations roughly as traumatic as migrations anywhere I've worked (which is to say, "somewhat"), but not too much more. It did take a lot of work to get the tooling right across multiple schemas in the same database server though. Investment in tooling--up to and including dedicated teams working on a single tool without a break-up date or firm mandate other than "make this not suck and keep the lights on"--is critical here, as in most areas.
Things that were hard:
- Connection management. Others on this thread have pointed that out. Connection pooling and long-lived queue workers were essential, and the web/request tier couldn't "scale out" too far without hitting connection limits. Scheduled jobs (and this company loved cron jobs, thousands of distinct invocations per tenant) were a problem in the connection-management department. Carefully written tooling around webserver connection reuse, cron-job execution harnesses (they didn't really run as cron jobs, they got shipped to a worker already running the code with warm/spare database connections and run there--all highly custom), and asynchronous jobs was needed. That occupied a team or three for awhile.
- The "whale" problem. When an individual tenant got big enough to start crowding out others on the same database server, it caused performance problems. We eventually worked on a migration tool that moved a client's entire footprint (and remember, this isn't just databases, but webs/caches/queue worker hosts/etc.) onto another shard. Building this tool was a lot of work, but when it was done it worked surprisingly well. My advice in this area: build a good manually-initiated/semi-supervised migration system. Leverage underlying database technology (binlog based replication). Don't hesitate to get very dirty and custom with e.g. replication logfile formats, and don't assume that $off_the_shelf_data_replicator isn't gonna collapse when you want to do online per-schema replication in massive parallel from the same database (not even if that tool cost you millions of dollars). Do NOT succumb to the allure of "we can automate the bin-packing and it'll constantly rearrange clients' datasets for optimal resource usage!" Manual is just fine for data migrations that big. Worst case, part of someone's job is to initiate/supervise them.
- SPOFs sucked. Some datasets weren't per-tenant at all; sometimes client companies merged together or split up; some data arrived intended for a tenant but wasn't tagged with that tenant's ID, so it would have to go into some separate database before it found a home. These systems were, bar none, the biggest liabilities, causes of production issues, and hardest things to code around in the entire company. You'd think that having to write application code for thousands of logical databases across all the per-tenant schemas would suck, but in reality it wasn't too hard. It was making sure your code didn't accidentally talk to a SPOF that was the problem. My advice here: microservices do not help with this problem. HTTP, gRPC, or raw database wire protocol: if you have dependencies on a "tragedy of the commons"-type used-by-everyone server sneaking into your nicely sliced up per-tenant architecture, those callsites are going to be the cause of your sleepless nights. Get good visibility into where they occur. Favor "push" into per-tenant models over per-tenant code doing a blocking "pull". Even if the push approach causes massive additional complexity and work. The costs of pull are too great.
- Some database specific shit (even on polished hosted offerings from AWS, or big-budget Oracle installs) will start acting really squirrely when you're talking to thousands of identical schemas on the same database server (and thus tens or hundreds of thousands of identical tables with different data). If you double down on this route, be prepared to have a few really, really good database folks on staff. I don't mean "help me fix my slow giant reporting query" people, I mean "familiar with the internals" folks. Example: query plans can be cached based on query text, globally, across an entire database server. Different schemas have super different clients, and thus super different data distribution among 100s of GBs of data. The plan that gets cached for query X against client A is the product of running heuristics/histograms/etc. across client A's data. That plan might perform pathologically when query X runs against client B (on the same database server)'s data, and finding out how/why is really annoying. Solution: bust the cache by a) happening to know that SQL comments aren't stripped from query texts before the text is used as a plan-cache key and b) prepend each tenant's identifier to each query at the database-driver level to prevent cache pollution. Result: you have traded a spooky query performance issue for a query-plan-cache-size issue; now your queries are all predictably slow because all your tenants' different queries are thrashing the plan cache. Tradeoffs abound.