Assuming serverless is out if the question for your use case, have you tried spending a couple of days investigating a managed Kubernetes cluster with node autoscaling enabled? EKS, AKS, GKE...
Honestly it sounds like you could be at the point where K8s is worthwhile.
I'm considering k8s, but that also means moving services from on-prem to AKS, getting INF to open up the necessary firewall rules to make the services reachable from on-prem, and so on. And as you said, it's definitely days of investigation. I'm not closed to the option.
How hyperbolic. I'm struggling to imagine what you may have experienced, exactly, to put discussing such issues in the UK alongside discussing Tiannamen Square in China.
I sometimes give feedback, but only if requested and, to be honest, only if the feedback request actually makes it through HR and to me.
What feels like a lifetime ago I went for an interview at Monzo, I think it'd recently renamed from Mondo. I thoroughly enjoyed the process - a kind yet thorough and revealing interview format. I didn't get the job - but it's not an exaggeration to say their feedback and the process changed my career. If somebody from there spots this; thanks :) (and I'm still sad I didn't get to work with you!)
I love it when I'm in a city with Citymapper. I've used it to get around Paris, Hamburg and Berlin. It's slick, fun to use and has nailed multi-modal transport - especially since it considers walking a thing which humans are capable of.
I still resort to instructional YouTube videos + ticket machines for tickets, though.
So, in a monorepo world, isn't it often that you have to deploy components together, rather than "it's easy to"? How are services deployed only when there has been a change affecting said service? Presumably monorepo orgs aren't redeploying their entire infrastructure each time there's a commit? Are we taking writing scripts which trigger further pipelines if they detect change in a path or its dependencies? How about versioning - does monorepo work with semver? Does it break git tags given you have to tag everything?
So many questions, but they're all about identifying change and only deploying change...
Each service has its own code directory, and there's one big "shared code" directory. When you build one service, you copy the shared code directory and the service-specific directory, move to the service-specific folder, run your build process. The artifact that results is that one service. Tagging becomes "<service>-<semver>" instead of just "<semver>". You may start out with deploying all the services every time (actually hugely simplifies building, testing, and deploying), but then later you break out the infra into separate services the same way as the builds.
> Are we taking writing scripts which trigger further pipelines if they detect change in a path or its dependencies
Unless one enforces perfect one-to-one match between repo boundaries and deployments, this is also an issue with multirepos.
In practice, it's straightforward to write a short script that deploys a portion of a repo and have it trigger if its source subtree changes and then run it in your CI/CD environment.
It's still a "double write" but you may not need distributed transactions if you're happy with at-least-once writes to Kafka and deduplication on the consuming side.
Microsoft have been pushing this kind of thinking for a while with Service Fabric. If you buy in completely and use both the framework and the infrastructure you get structures which are in-memory and replicated for you.
A couple of the .Net guys we hired preached that stateless architecture is a little old-fashioned - over time I've come to agree. A lot of things can be shoe-horned in to a stateless world but become much easier in a stateful one.
> Imagine sending out a command to the bus and not knowing when it'll get processed
I would love to hear how others are correlating output with commands in such architectures - especially if they can be displayed to users as a direct result of a command. Always felt like I'm missing a thing or two.
It seems the choices are:
* Manage work across domains (sagas, two phase commit, rpc)
* Losen requirements (At some point in the future, stuff might happen. It may not be related to your command. Deal with it.)
* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)
Is that a fair summary of where we can go? Any recommended reading?
My personal answer would be that commands (as defined as "things the issuer cares about the response to") don't belong on message busses, and there's probably an architectural mismatch somewhere. Message busses are at their best when everything is unidirectional and has no loops. If you need loops, and you often do, you're better off with something that is designed around that paradigm. To the extent that it scales poorly, well, yeah. You ask for more, it costs more. But probably less than trying to bang it on to an architecture where it doesn't belong.
You want something more like a service registry like zookeeper for that, where you can obtain a destination for service and speak to it directly. You'll need to wrap other error handling around it, of course, but that almost goes without saying.
I don't know about correlating output with commands, but if you're looking to correlate output with input, one option is to stick an ID on every message, and, for messages that are created in response to other messages, also list which one(s) it's responding to.
I would say that loosening requirements is also a reasonable option. You can't assume that anything downstream will be up, or healthy, or whatever. On a system that's large enough to benefit from a message bus, you have to assume that failures are the exception and not the norm. And trying to get a system that acts like that is the case is likely to be more expensive than it's worth. For a decent blog post that touches on the subject, see "Starbucks Does Not Use Two-Phase Commit"[1].
Nice blog post! Certainly puts things into perspective in terms of how one should deal with errors, including sometimes just not caring about them much.
Commands can go through message busses and be managed easily or it could just be a sequence of async requests, but regardless of what drives commands and events at that point you should have a very solid CQRS architecture in mind. What should be acknowledged to the client is that the command was published and that's it. The problem is of course eventual consistency but it's a trade-off for being able to handle a huge amount of load by scaling separately both COMMAND handlers which perform data modification, and EVENT handlers that allow the side effects that must occur.
In a typical web app setup I would define a request ID at time of client request. The request creates a COMMAND which carries with it a request ID as well as a command ID. This results in an action and then the EVENT is published with the request ID, command ID, and event ID.
To monitor you collect the data and then look at the timestamp differences to monitor lag and dropped messages. With the events, you get all the data necessary to audit what request, and subsequent command, created a system change. To audit the full data change however, and not just which request caused what change, you need to have a very well-structured event model designed for what you want to audit.
You can't guarantee when a command or subsequent event will be processed, but that's fine. That's the whole point around eventual consistency. It's a bit uncomfortable at first, but use the lag monitoring and traceability as a debug tool when needed and really it's no problem. Also just shift the client over to reading from your read-specific projections on refresh or periodically and data will eventually appear for the user. It's the reason sometimes a new order might not appear right away on your order history for instance on Amazon, and in reality it's fine 99% of the time. Never have your client wait on a result. Instead think: how can I design my application to not need to block on anything? It's doable though it is quite hard and if you've only designed synchronous systems it will feel so uncomfortable.
And remember some things should not have CQRS design, backed by a message bus or not. These will be bottlenecks but they might be necessary. The whole system you design doesn't have to stick to a single paradigm. Need transaction IDs to be strictly consistent for a checkout flow to be secure and safe? Use good old synchronous methods to do it.
Core in all of this is data design. If you design your entities, commands, or events poorly, you will suffer the consequences. You will often hear the word "idempotency" a lot in CQRS design. It's because idempotent commands are super important in preventing unintended side effects. Same with idempotent events, if possible. If you get duplicate events from a message bus, idempotency will save your arse if it's a critical part of the system. If it's something mild like an extra "addToCart" command or something, no big deal really, but imagine a duplicated "payOrder" command ;).
To summarize, I correlate output to commands and requests by ensuring there are no unknown side-effects, critical synchronous components remain synchronous, designing architecture that compliments the data (not the other way around), and ensuring that the client is designed in such a way that eventual consistency doesn't matter from the user perspective when it comes into play.
> Imagine sending out a command to the bus and not knowing when it'll get processed.
In my systems I separate commands from event handlers based on asynchronicity.
Commands are real time processors and can respond with anything up to including the full event stream it created.
Commands execute business logic in the front end and emit events.
Commands execute on the currently modeled state by whatever query model you have in place chosen depending on needs for consistency.
What I suspect @CorvusCrypto is talking about is event handlers, which are in essence commands but are usually asynchronous.
They are triggered when another event is seen but could theoretically happen whenever you like. It could be real time as events are stored or it could be a week later in some subscriber process that batches and runs on an interval.
I separate commands from event handlers like this because commands tend to be very easy to modify and change in the future, they're extremely decoupled in that they just emit events that can easily be tested without having to do a lot of replay or worrying about inter-event coupling.
Event handlers on the other hand depending on type tend to be very particular/temperamental about how and in what order they get triggered.
I also find having a system with a lot of fat event handler logic to have a lot more unknown / hidden complexity, keeping as much of the complexity and business logic in the front end (RPC included) results in a much simpler distributed system.
All this hinges on the fact I'm sticking to strict event sourcing where events are after the fact and simply represent state changes which are then reduced and normalized per system needs.
I would also like to point out, I was careful here to not to mention any kind of message bus or publishing because CQRS and event sourcing are stand alone architecture choices.
CQRS/ES does not require a message bus, in fact it specifically sucks with a message bus at the core of your design because it forces eventual consistency and it puts the source of truth on itself.
CQRS/ES systems should have multiple message buses and employ them to increase availability and throughput at a trade off with consistency. CQRS/ES should not force you to make this trade.
A message bus is a tool to distribute processing across computers. It is not and should not be at the central philosophy of your architecture. You should be able to continuously pipe your event store through a persistent RabbitMQ for one system that is bottlenecked by some third party API with frequent downtime problems. And you should be able to continously pipe your event store through some ZeroMQ setup for fast realtime responsiveness in another system. Whether or not you choose to introduce system wide inconsistency (or "eventual consistency") in order to pipe your events into your event store is up to you to figure out if the increased availability is worth the trade off.
Honestly it sounds like you could be at the point where K8s is worthwhile.