> Imagine sending out a command to the bus and not knowing when it'll get processed
I would love to hear how others are correlating output with commands in such architectures - especially if they can be displayed to users as a direct result of a command. Always felt like I'm missing a thing or two.
It seems the choices are:
* Manage work across domains (sagas, two phase commit, rpc)
* Losen requirements (At some point in the future, stuff might happen. It may not be related to your command. Deal with it.)
* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)
Is that a fair summary of where we can go? Any recommended reading?
My personal answer would be that commands (as defined as "things the issuer cares about the response to") don't belong on message busses, and there's probably an architectural mismatch somewhere. Message busses are at their best when everything is unidirectional and has no loops. If you need loops, and you often do, you're better off with something that is designed around that paradigm. To the extent that it scales poorly, well, yeah. You ask for more, it costs more. But probably less than trying to bang it on to an architecture where it doesn't belong.
You want something more like a service registry like zookeeper for that, where you can obtain a destination for service and speak to it directly. You'll need to wrap other error handling around it, of course, but that almost goes without saying.
I don't know about correlating output with commands, but if you're looking to correlate output with input, one option is to stick an ID on every message, and, for messages that are created in response to other messages, also list which one(s) it's responding to.
I would say that loosening requirements is also a reasonable option. You can't assume that anything downstream will be up, or healthy, or whatever. On a system that's large enough to benefit from a message bus, you have to assume that failures are the exception and not the norm. And trying to get a system that acts like that is the case is likely to be more expensive than it's worth. For a decent blog post that touches on the subject, see "Starbucks Does Not Use Two-Phase Commit"[1].
Nice blog post! Certainly puts things into perspective in terms of how one should deal with errors, including sometimes just not caring about them much.
Commands can go through message busses and be managed easily or it could just be a sequence of async requests, but regardless of what drives commands and events at that point you should have a very solid CQRS architecture in mind. What should be acknowledged to the client is that the command was published and that's it. The problem is of course eventual consistency but it's a trade-off for being able to handle a huge amount of load by scaling separately both COMMAND handlers which perform data modification, and EVENT handlers that allow the side effects that must occur.
In a typical web app setup I would define a request ID at time of client request. The request creates a COMMAND which carries with it a request ID as well as a command ID. This results in an action and then the EVENT is published with the request ID, command ID, and event ID.
To monitor you collect the data and then look at the timestamp differences to monitor lag and dropped messages. With the events, you get all the data necessary to audit what request, and subsequent command, created a system change. To audit the full data change however, and not just which request caused what change, you need to have a very well-structured event model designed for what you want to audit.
You can't guarantee when a command or subsequent event will be processed, but that's fine. That's the whole point around eventual consistency. It's a bit uncomfortable at first, but use the lag monitoring and traceability as a debug tool when needed and really it's no problem. Also just shift the client over to reading from your read-specific projections on refresh or periodically and data will eventually appear for the user. It's the reason sometimes a new order might not appear right away on your order history for instance on Amazon, and in reality it's fine 99% of the time. Never have your client wait on a result. Instead think: how can I design my application to not need to block on anything? It's doable though it is quite hard and if you've only designed synchronous systems it will feel so uncomfortable.
And remember some things should not have CQRS design, backed by a message bus or not. These will be bottlenecks but they might be necessary. The whole system you design doesn't have to stick to a single paradigm. Need transaction IDs to be strictly consistent for a checkout flow to be secure and safe? Use good old synchronous methods to do it.
Core in all of this is data design. If you design your entities, commands, or events poorly, you will suffer the consequences. You will often hear the word "idempotency" a lot in CQRS design. It's because idempotent commands are super important in preventing unintended side effects. Same with idempotent events, if possible. If you get duplicate events from a message bus, idempotency will save your arse if it's a critical part of the system. If it's something mild like an extra "addToCart" command or something, no big deal really, but imagine a duplicated "payOrder" command ;).
To summarize, I correlate output to commands and requests by ensuring there are no unknown side-effects, critical synchronous components remain synchronous, designing architecture that compliments the data (not the other way around), and ensuring that the client is designed in such a way that eventual consistency doesn't matter from the user perspective when it comes into play.
> Imagine sending out a command to the bus and not knowing when it'll get processed.
In my systems I separate commands from event handlers based on asynchronicity.
Commands are real time processors and can respond with anything up to including the full event stream it created.
Commands execute business logic in the front end and emit events.
Commands execute on the currently modeled state by whatever query model you have in place chosen depending on needs for consistency.
What I suspect @CorvusCrypto is talking about is event handlers, which are in essence commands but are usually asynchronous.
They are triggered when another event is seen but could theoretically happen whenever you like. It could be real time as events are stored or it could be a week later in some subscriber process that batches and runs on an interval.
I separate commands from event handlers like this because commands tend to be very easy to modify and change in the future, they're extremely decoupled in that they just emit events that can easily be tested without having to do a lot of replay or worrying about inter-event coupling.
Event handlers on the other hand depending on type tend to be very particular/temperamental about how and in what order they get triggered.
I also find having a system with a lot of fat event handler logic to have a lot more unknown / hidden complexity, keeping as much of the complexity and business logic in the front end (RPC included) results in a much simpler distributed system.
All this hinges on the fact I'm sticking to strict event sourcing where events are after the fact and simply represent state changes which are then reduced and normalized per system needs.
I would also like to point out, I was careful here to not to mention any kind of message bus or publishing because CQRS and event sourcing are stand alone architecture choices.
CQRS/ES does not require a message bus, in fact it specifically sucks with a message bus at the core of your design because it forces eventual consistency and it puts the source of truth on itself.
CQRS/ES systems should have multiple message buses and employ them to increase availability and throughput at a trade off with consistency. CQRS/ES should not force you to make this trade.
A message bus is a tool to distribute processing across computers. It is not and should not be at the central philosophy of your architecture. You should be able to continuously pipe your event store through a persistent RabbitMQ for one system that is bottlenecked by some third party API with frequent downtime problems. And you should be able to continously pipe your event store through some ZeroMQ setup for fast realtime responsiveness in another system. Whether or not you choose to introduce system wide inconsistency (or "eventual consistency") in order to pipe your events into your event store is up to you to figure out if the increased availability is worth the trade off.
I would love to hear how others are correlating output with commands in such architectures - especially if they can be displayed to users as a direct result of a command. Always felt like I'm missing a thing or two.
It seems the choices are:
* Manage work across domains (sagas, two phase commit, rpc)
* Losen requirements (At some point in the future, stuff might happen. It may not be related to your command. Deal with it.)
* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)
Is that a fair summary of where we can go? Any recommended reading?