Our production scale is just copied stage environments because our clients are trivially shardable and they only use our service from 9-5 m-f, but hoo boy are there a lot of them
Yeah it s nice to be trivially shardable ! Ours is not: stock market connections to 13 countries with clients having global trading controls across these countries - so we can shard some parts and not others and each country having specificities, guaranteeing a cross market feature works for a client, if it impacts control, is tested in prod, either after market close on mock exchanges provided by the various countries, or during trading hours on small orders.
No amount of testing in the bank has ever been able to spot the weirdest issues, so we continue while really trying hard to make the prod "pilots" (we try not to call these tests) as routine as possible. But, we still find crazy issues, notably those only the client notices (so wrong specs they couldnt prevalidate by plugging their monster system on our monster system in QA)
Let s imagine we build all the trading robots to have one server per group of client. That would def work, we probably cant pay for one set of server (for HA) per client since we have hundreds, but then all these per client servers have to queue up for the exchange line access which has per country controls (cant move the market more than x %, cant trade x or y stock at z time whatever). So yes in a way, end of the line, the clients must "know abt each other" so we can respect the 13 different law codes.
We split by exchange traditionally because it made the most sense when we started 30 years ago in Asia, but... We re more and more splitting clients in groups where we can and allocating them cpu power indeed.
I envy the stock market people because while they must handle even higher volume, they can shard per stock itself and have just one jursidiction.
cool, thanks for satisfying my curiosity. This sounds like a very interesting problem! Coming from an Elixir/Erlang background, I would probably architect it with clusters of "rate-limiting" backend agents sharded by exchange or jurisdictions and a cluster of client-sharded groups (in its own VPS, even) for client information. But yeah, it would be tough to migrate to such a setup from something more brownfield.