I've worked with several event sourcing systems and was even seduced into implementing one out of sheer hubris once. These problems are ever present in every ES project I've had the misfortune of coming into contact with. It doesn't even mention the worst part that comes afterwards, when you realize after all of that pain that it is only used by a single person in the company to generate a noncritical report comparing a meaningless KPI that could have been manually done in four hours by a different intern every quarter. By the time the tooling is up and running enough to make a stable system, no one will trust it enough to use ES's landmark features except for a few developers still coming off the koolaid.
99% of the time when ES sounds like a good idea, the answer is to just use Postgres. Use wal2json and subscribe to the WAL stream - if you really need time travel or audit logs or whatever, they'll be much cheaper to implement using WAL. If you need something more enterprisey to sell to the VP-suite, use Debezium.
Event sourcing sounds so awesome in theory and it is used to great effect in many demanding applications (like Postgres! WAL = event sourcing with fewer steps) but it's just too complex for non-infrastructure software without resources measured in the 10s or 100s of man years.
This is a common problem I see across many things.
We had a guy who spent two weeks writing a report script that collated data uploaded into S3 and wrote out another data lump into S3 after some post processing then sent this data lump to a guy via email. This entire thing had a bunch of lambda layers to pull in the pipeline, a build pipeline in jenkins, terraform to deploy it. The python script itself was about 200 lines of the equivalent of spreadsheet cell manipulation basically.
Only after it was implemented we found out it was only run every 3 months and took the original dude 5 mins to paste it into his existing excel workbook and the data popped out in another sheet instantly. This was of course a “massive business risk” and justified a project to replace it. What it should have been was a one page thing in confluence with the business process in it and the sheet attached. But no it now needs a GitHub repo full of shite and 2 separate people to understand every facet of it who require 3x the salary of the original dude each. And every 3 months, something is broken or it pukes an error which requires someone 3 hours of debugging and reverse engineering the stack to work out why.
Fundamentally, event sourcing, as the above, is the outcome of seeing something shiny and covered in marketing and wanting to play with it on any available business case rather than doing a rational evaluation of suitability. There are use cases for this stuff, but sometimes people don’t know when to apply the knowledge they have.
As for ES itself, I have resisted this one big time, settling on CQRS class functionality which is quite frankly no difference to query/mutate database APIs using stored procedures in PL/sql I dealt with in the dark ages. People also don’t have the architectural skills or conceptual ability generally to rationalise the change on data consistency on top of that or the recovery concerns. I would rather find another way to skin the scalability cat than use that architecture.
You don't give people enough credit. I know damn well that the "complicated" solution is at best a monumental waste of money but doing the same trivial CRUD shit for years on end is neither intellectually stimulating nor good for my career. Until "the business" finds a way to change that I will use every available opportunity for "resume-driven development". Really, I don't give a crap if the shareholders make money or not. I only care about my own paycheck.
I wouldn't use the overengineering of a single script influence your view of the technology you named (S3 data laking, lambda pipelines, build in jenkins, terraform to deploy). Those are well understood concepts and technologies to do reporting. The problem here is that these tools were not in use at the company. Once you have 50 scripts run by different people on different schedules in different excel version, having everything go through S3 (immutable, cheap), ingested by lambdas (scalable, well versioned, monitored), built and deployed in jenkins using terraform (vs say, aws command line scripts) significantly reduces the complexity of the problem.
Event sourcing can be a very powerful pattern if used correctly. You don't need to combine ES with Eventual Consistency. ES can be implemented in a totally synchronous manner and it works very elegantly capturing nicely all the busines events you need. You get a very detailed audit for free and you don't loose importan business data. You can travel back in time, construct different views of data (projections) etc. Most of the complications arise when you bring in Eventual Consistency, but you absolutely don't have to.
I agree. Audit and history functionality have been the motivating features for me in building systems that are "ES-lite," where the event stream is not a general-purpose API and is only consumed by code from the same project.
Sometimes the ability for an engineer to fetch the history of a business object out of a datastore and explain what it means checks an "audit trail" requirements box. Sometimes showing the history of a business process in a UI, with back-in-time functionality, is a game changing feature. If so, using event sourcing internally to the service is a great way to ensure that these features use the same source of truth as other functionality.
Where you get into trouble is when you realize that event sourcing will let you distribute the business logic related to a single domain object to a bunch of different codebases. A devil on your shoulder will use prima facie sound engineering logic tell you it is the right thing to do. The functionality relates to different features, different contexts, different operational domains, so keeping it in the same service starts to feel a bit monolithic.
But in practice you probably don't have enough people and enough organizational complexity to justify separating it. What happens is, a product manager will design an enhancement to one feature, they'll work out the changes needed to keep the product experience consistent, and the work will get assigned to one engineer because the changes are reasonable to get done in one or two sprints. Then the engineer finds out they have to make changes in four different systems and test six other systems for forwards compatibility. Oops. Now you have nanoservices-level problems, but only microservices-level capabilities.
If you aren't tempted down that path, you'll be fine.
For an interactive application it's an absolute nightmare when you don't have read-your-own-writes support (including synchronization for "indexes"). Async background updates aren't always appropriate or easy to do.
few questions:
- what was your biggest ES system that you worked on?
- how many people worked on it?
- how did the ES system particulary solve your problem?
- what was the tech stack?
No, I'm not a consultant. Maybe lead developer is the most accurate title for me:) By what criteria the biggest system do you mean? They certainly weren't toy projects, these are real systems in active use solving real world problems for thousands of users. The ES part of these systems is mainly implemented in .NET using excellent Marten DB as ES store on top of PostgreSQL. I would say that ES changes drastically how you model things and see business. It forces you to identify and define meaningful events. These are often actually something that also non programmers understand so it also improves creatly communication with your clients(or in DDD terms domain experts:)) as a by-product. Scaling also has not been a real issue, these systems can handle thousands of operations per second without exotic hardware and setups, all this mostly synchronously.
And I must add, use the right tool for the job, there are many cases where ES is not a good fit. Also, if you choose to use ES in your project, you don't have to use ES for everything in that project. Same thing actually applies to asynchronous processing. If something in the system doesn't scale synchronously, that doesn't mean you now should do everything asynchronously.
Not the OP, not a consultant, the biggest ES system I work on is an order management system in one of the biggest investment bank in APAC, which processes orders from clients to 13 stock exchanges.
40 people approx work on it in Asia, maybe around 100 globally at the raw dev level.
I feel it's a sort of false good idea for our particular problems. Clients trade ether at the 10ms latency for high value orders or sub-ms for latency-sensitive ones. They conceptualise what they want to buy and a bunch of changes they want to apply on that bulk: for instance, change in quantity, price limits, time expiry and try to make money by matching a target price either by very closely following a prediction curve or spending as little time on doing so (giving us only a constraint and asking us to fit it by doing our own curve fitting algos).
The tech stack is pure java with a kernel of C++ for networking, with as little external dependency as possible. And no GC beyond the first few minutes after startup (to prealloc all the caches). Everything is self built.
How does it solve the problem: it simply is the only way we can think of to over optimize processing. If you want to fit a millisecond or sub millisecond target (we use fpga for that), you must cut the fat as much as possible. Our events are 1kb max, sent in raw tcp, the queue is the network switch send queue, the ordering has to he centrally managed per exchange since we have only one final output stream but we can sort of scale out the intermediary processing (basically for one input event how to slice in multiple output events in the right order).
I'd say it doesnt work: we run into terrifying issues the author of the main link pointed out so well (the UI, God, it's hard, the impossible replays nobody can do, the useless noise, solving event stream problems more than business problems etc), but I d also say I cant imagine any heavier system fitting better the constraint. We need to count the cycles of processing - we cant have a vendor library we cant just change arbitrarily. We cant use a database, we cant use heavier than tcp or multicast. I'll def try another bank one day to see how others do because I m so curious.
Are you really using event sourcing, or just message queues?
I worked for a large US bank, and had close contact with those who worked on the asset settlement system. I don't have as deep insight as you do, obviously. But the general architecture you describe sounds very similar. Except they clearly used message queues and treated them as message queues, and not event streams / sourcing / whatever. They used a specific commercial message broker that specializes in low latency.
My experience also. I've worked with 3 different ones and I think the domain fit was good for only one of them. In that one domain model, the fit was spot on because that domain was just a bunch of events and every question you asked of the domain model was really saying "create me an object with the events between time X and Y". That was it tho, all other domains I've worked on would not suit ES - even though they would suit CQRS without the ES piece.
Thanks a lot for mentioning Debezium. Agreed that a CDC-based approach will be simpler to implement and reason about than ES in many cases. Semantics are different a bit of course (e.g. ES events capturing "intend"); we had a post discussing the two things a while ago on the blog [1]. Speaking of capturing intend, we just added support for Postgres's pg_logical_emit_message() function in today's release of Debezium 1.8.0.Beta1 [2].
Am I crazy for wanting a standardized WAL format that can be treated as an event stream for anything from replicas to search services to OLAP? Why can't we drink straight from the spigot instead of adding abstractions on abstractions?
Its honestly not that hard to build a WAL-style solution exactly the way you want it for your own application. You just have to get away from the "you shouldn't write your own xyz" boogeyman experience long enough to figure it out.
Do you know how to model your events using a type system?
Can you be bothered to implement a Serialize and Deserialize method for each event type, or simply use a JSON serializer?
Do you know how to make your favorite language compress and uncompress things to disk on a streaming basis?
Do you know how to seek file streams and/or manage a collection of them all at once?
Can you manage small caches of things in memory using Dictionary<TK,TV> and friends?
Are you interested in the exotic performance possibilities that open up to those who leverage the ring buffer and serialized busy wait consumer?
If you responded yes to most or all of the above, you are now officially granted permission to implement your own WAL/event source/streaming magic unicorn solutions.
Seriously, this stuff is really easy to play with. If you follow the rules, you would actually have a really hard time fucking it up. Even if you do, no one gets hurt. The hard part happens after you get the events to disk. Recovery, snapshots, garbage collection - that's where the pain kicks in. But, none of these areas is impossible. Recovery/Snapshots can again be handily defeated by the mighty JSON serializer if one is lazy enough to succumb to its power. Garbage collection can be a game of segmenting log files and slowly rewriting old events to the front of the log on a background thread. The nuance is in tuning all of these things in a way that makes the business + computer happy at the same time.
Plan to do it wrong like 15 times. Don't invest a whole lot into each attempt and you can pick this up really fast. Try to just write it all yourself. Aside from the base language libraries (file IO, threading, et. al), JSON serializer and GZIP, you really should just do it by hand because anything more complex is almost certainly wrong.
100% agree. I only used wal2json because my risk tolerance for the project was between "can't be bothered to pay the onboarding/maintenance cost of Kafka for Debezium" and "can't be bothered to implement a reader for the (well documented IIRC) stable binary WAL format" which is a weird spot to be in. This was a PoC written meant to demonstrate how we could implement the features we needed from ES using no more than standard Postgres tooling and a tiny service that was good enough to roll right into production with minor changes. It took under a week, though ideally I would have take the time to decoded the binary WAL format directly after saving the raw stream for safety's sake. Rust wasn't even an option at the time, nowadays it'd be mostly a bunch of macros and annotations with a sprinkingly of hand written FromBytes implementations and tiny bit of IO+serde glue code.
IIRC it took another data scientist and engineer under a month to turn the raw WAL logs into an audit log interface with pretty SSO avatars and weekly reporting. Someone from the devops team with DBA experience implemented time traveling staging DBs with continuous archiving and point in time recovery from production in the same time. Someone else later improved it so PITR used full backups created from a filtered WAL log so devs could select which parts of the production DB they copied over instead of each babying their own staging cluster that took days to rebuild. The whole project ended up giving us all of the benefits of event sourcing using standard, well tested tooling for a fraction of the cost.
We've thrown around the phrase "not invented here syndrome" so much that we've over corrected - as humans are wont to do - and now the younger generation thinks architectures like event sourcing or infrastructure like Kafka are a better solution than to just consume replication logs over a TCP connection to one of the most popular open source databases on the planet (not directed at the GP but my former coworkers :)). I'm starting to wonder if I've reached the age where I sound like the adults in the Peanuts cartoons, except the sound vaguely resembles "Get your resume driven development off my lawn!"
> "can't be bothered to pay the onboarding/maintenance cost of Kafka for Debezium"
Debezium can also be used without Kafka; either via Debezium Engine [1], where you embed it as a library into your JVM-based application and it will invoke a callback method you registered for every change event it receives. That way, you can react to change events in any way you want within your application itself, no messaging infrastructure required. The other option is using Debezium Server [2], which takes the embedded engine to connect Debezium to all sorts of messaging/streaming systems, such as Apache Pulsar, Google Cloud Pub/Sub, Amazon Kinesis, Redis Streams, etc.
> 99% of the time when ES sounds like a good idea, the answer is to just use Postgres.
I like some of the things in Event Sourcing very much.
However, whenever I ask "Why not just use PostgreSQL as the event store until I'm 5 orders of magnitude bigger?" I never seem to get a really good answer.
from your experience there is no framework or set of libraries which would bring the advantages at lesser cost other than wal2json? Is still build from scratch the way to go and the shortcut to use common and more generic tools?
As this article pops out again, I'd like to the point that although it may have some valid points, those points are not about Event Sourcing. What's expressed in the article is the Event Streaming or Event-Driven approach. So when events are not the source of truth etc. All of the event stores that I know supports strong consistency on appends, optimistic concurrency. Many guarantee global ordering. Some help in idempotency. All of that helps to reduce those issues.
In Event Sourcing, events are the state. So the flow looks like that:
1. You get the events from the stream (that represents all the facts that happened for the entity),
2. You apply it one by one in the order of appearance to build the current state,
3. You run the business logic and, as a result, create the event,
4. You append a new event.
You don't need to cache the write model state anywhere, as the state is in events. Event Sourcing has its issues and troubles, but the stuff described in the article apply to tools like Kafka, etc. They're not tools for Event Sourcing but Event Streaming. They're designed to move things from one place to another, not to be used as durable databases.
Howdy! Author here ^_^ I'll respond to a few items because, even though I haven't touched the system in a few years, I could still rant endless about the mistakes I made building it. Deep scars were acquired!
Firstly, to make sure we're talking about the same thing, where are you setting the bar for whether or not we can call something "Event Sourcing"? For instance, just to clarify, in our system events were indeed the source of truth. At the heart of everything sat a ledger, that ledger was stored in a database, and events were added to it as you describe in points 1-4 (hand waving away minor differences). You got a consistent, up-to-date state of an aggregate by replaying the ledger in order and applying all the events. So, to my understanding, I'd call that textbook. However, the problem may lay in our actual definition of the word.
>You don't need to cache the write model state anywhere, as the state is in events
I'd have to understand a bit more about where you're coming from with this one. While, yep, the state is in the events and you get the latest state by playing them back, that materialization is not free. If you're doing it for a single aggregate, then it's generally not a huge deal at small scales. However, once you need to make decisions against more than one thing, or compare things, or query things, that materialization cost becomes prohibitive, and you've got to set up materialized projections. With our tech stack, that had to be an eventually consistent process, and what got us into event streaming, and which ultimately caused most of my ugly crying.
As someone who's evaluating event sourcing for a real business use case, I really appreciated your article. I wish more people published their event sourcing war stories, rather than most resources being exclusively from consultants selling snake oil.
I think the point OP is making is actually the same as what you conveyed in your article--several of the worst pain points you experienced aren't inherent to Event Sourcing, they're due to a particular implementation of Event Sourcing. The problem is that that implementation is pretty much the canonical one sold by Martin Fowler et al.
I can think of a few specific ways in which event sourcing could be different than what you experienced:
1) Distributed computing is always hard. Event Sourcing is not a magical cure-all to distributed computing woes, but it's sold as that, and the result is that people go into a distributed event-driven architecture without being prepared for the engineering challenges that await. Eliminate distributed computing, and many problems simply go away.
2) Eventual consistency is always hard, but Event Sourcing doesn't inherently require eventual consistency, even for cached projections. If your projections are pure functions folded over the event stream, then caching and updating the materialized projections with strict consistency is straightforward. In the same transaction as inserting a new event, you run your projection function once on each of the materialized projections and the new event (`state = projection(oldState, event)`). This is only possible when you're not trying to create your projections in a distributed fashion, so see #1.
3) Language choice makes a big difference for reducing the cognitive overhead of updating projections. Your compiler should be able to tell you every place where you need to update a projection to account for a new event. Preferably, you should be able to define a hierarchy of events so that it only yells at you about places where the event is actually relevant.
Thanks for answering! The bar is as I described. The events are the source of truth, only if you're using them in the write model as a basis for the state rehydration. If you're using materialised view, even though it's built based on events, then you outsourced the truth to other storage. If you're doing a pattern that you're just using events to build up the materialised view you use for the write model logic, that can lead to using Event Streaming tools like Kafka, Pulsar, etc. And, as you said, this is a dead-end. Read more on what I mean by getting the state in Event Sourcing system: https://event-driven.io/en/how_to_get_the_current_entity_sta...
Of course, rebuilding a state from events each time you're processing command may sound dubious. Still, event stores (even those with relational DBs as backing storage) are optimised to read events quickly. Typically, reading event 100 events are not an issue for them. Also, for a typical LoB application. The temptation to use Snapshots is strong, but it should be treated as an optimisation when there is huge performance requirements. They may get out of sync with events. They may be stale, etc. I wrote about that in: https://www.eventstore.com/blog/snapshots-in-event-sourcing. The critical part is to keep streams short. That's again a difference between Event Streaming and Event Sourcing. In Kafka, you may not care how long your topic is, as it's just a pipe. Event stores are databases, so the more events you have, the worse. It impacts not only performance but also makes schema versioning harder (more on that: https://event-driven.io/en/how_to_do_event_versioning/).
It's easy to fall into the trap, as using a streaming solution is tempting. They promise a lot, but eventually, you may end up having issues as you described.
Still, that can be said on any technology. Event Sourcing has some dark parts. Also, it's essential to highlight them, but it's dangerous to present other tools and patterns issues related to ES.
Event Sourcing by itself doesn't directly relate to eventual consistency, type of storage, messaging, etc. Those are implementation details and tradeoffs we're chosing. Each storage solution has it's patterns and anti-patterns. Relational databases have normalisation. Document databases are denormalised. Key-value stores have strategies for key definition. Event stores also have their specifics. The most important is (as I mentioned) to take into the account the temporal aspect of streams, so keeping them short. I agree that there is a huge gap missing in knowledge sharing. I'm working on the article recently about that aspect. Here's a draft: https://github.com/EventStore/blog-articles/blob/closing-the....
I think that you highlighted well what can happen if you use the Event Streaming solution as a basis for the Event Sourcing solution. But unfortunately, the title and description state that it's about Event Sourcing, which is unfortunate, as it's highly misleading for people who are not aware of what Event Sourcing is and may suggest that it's much more complicated than it's in reality.
See also my other comments below in the thread, where I commented also in more details on other parts.
You're right that there is a distinction between event sourcing within a service and events as coupling between services. But as someone who architected a service that used event sourcing internally and did not use event streaming, I feel the author's points still stand. I also currently work on a very complex application built on event sourcing principles, and again, the same challenges are present.
The reality is that event sourcing brings a lot of challenges that have to be grappled with. It's a great paradigm, but it's not easier than in-place mutation. At least, not without a bunch of tooling.
I re-read the article after this comment, and I would have to disagree. The article does never attempt to explain "what" recent sourcing means to them, so it's hard to know for sure. However, they do mention populating state from an event log which contains meaningless events, so I have to assume they _are_ talking about event sourcing.
What leads you to believe they are taking about just streaming, and not sourcing?
Could you provide an exact quote? I haven't been able to find any usage of "populate". I found:
`the raw event stream subscription setup kills the ability to locally reason about the boundaries of a service.`
`you have to talk to the people who will be consuming the events you produce to ensure that the events include enough data for the consuming system to make a decision`
`wire a fleet of services together via an event stream`
etc.
Plus the drawing that's at the beginning.
This clearly states that the author doesn't understand what's Event Sourcing and tells about Event Streaming. Event Streaming means:
1. Producer pushes/produces/publishes an event to the queue. In this sense, an event stream is a pipe where you put an event at the end and read it on the other side.
2. Then, you have a set of subscribers that can react to those events. The reaction may be triggering the next step of the workflow. It might also be updating the read model state. That happens with eventual consistency. You also cannot guarantee atomicity or optimistic concurrency, as tools like Kafka doesn't support it. As they were not originally built for. They let you effectively publish and subscribe to the messages.
3. Because of that, if you want to have the write model materialised from an event (or even "populated"). Then you don't have any guarantee if it's stale or not. You need to fight with out of order messages and idempotency. Thus all of those issues that are described in the article.
I'm sorry to say, but the author used the wrong tool for the wrong job. If this article be titled "Event Sourcing is hard if you're confusing it with Event Streaming" then it'd be an excellent article. But, in its current shape, it's just misleading and making the wrong point, repeating just common misunderstandings.
You seem to be saying that event sourcing and event streaming are mutually exclusive.
In almost all cases of event sourcing I have seen, event sourcing is event streaming, but more. When you say event sourcing, where do other services source from, if there is no stream?
Yes, event sourcing typically uses a database instead of a real "queue", but it still functions like a queue. You still need consumers to subscribe to events, in order to populate state, as timely as possible.
They're not mutually exclusive, but they're orthogonal. They're different patterns that happen to integrate with each other. Event Sourcing is about durable state stored and read as event, Event Streaming is about moving events from one place to another.
Event stores can integrate with the streaming platforms, to publish events and move them forward. Most of them have even built-in implementation of the Outbox Pattern that allows to publish events. They're usually called subscriptions. They can be used to build read models or triggers other flows, or route events to streaming platforms.
The essential point is that for read models, you don't expect to have the same guarantees as for write models. It's fine to be facing idempotency, even out of order, as read models shouldn't be used (in general) for business logic validation. In Event Streaming, or models are stale. In Event Sourcing, write model is not stale, you can be sure that you're making a decision on the current state. Thus, most of the points mentioned in the article, are just from struggles to use event streaming solution as tools for storing events. Which is not the issue of Event Sourcing or event stores per se, but using the wrong tool for the job.
An app using event sourcing does not need to be consuming events from an external messaging service nor does it need to be publishing messages to one. It could very well be receiving requests over a synchronous REST API from an HTML/JS UI.
I believe the point OP is trying to make, is that integration with a messaging service/queue is a separate concern from what event sourcing solves but that the internal event pipeline often looks very similar to a networked one and so the two solutions are conflated.
They were probably also sourcing their state from the events. However most of their problems come from sharing the events between the modules/services which is not a part of Event Sourcing.
I think that they were just building the stale read models, and used them as the write model, which created the whole confusion.
Regarding the sharing events between module, it's one of the most common and the most dangerous mistakes. It's a leaking abstraction that will eventually create a distributed monolith. It has only downsides of monolith and microservices, without the upsides.
I wrote longer on the topic of internal and external events, and how to model them: https://event-driven.io/en/events_should_be_as_small_as_poss...
What you are talking about is CQRS, which is a very valid pattern, and pairs well with event sourcing, but is not necessary part of event sourcing. You don't have to split your read and write models for events sourcing.
I am confused when you say sharing events isn't part of event sourcing. How does a service populate it's state from other services event source if it can't access it's events?
> How does a service populate it's state from other services event source if it can't access it's events?
Because it's the source of the events and its own system of record.
If an event sourced app wants to share events, it should not be re-using internal events but creating new items intended for distribution just like you would with any other distributed system (thrift/protobuf over Kafka).
I think one major problem is that "Event Sourcing" can mean subtly different things to different people.
> The idea of a keeping a central log against which multiple services can subscribe and publish is insane.
This really doesn't mean "Event Sourcing" to me, it sounds like enterprises that have decided Event Sourcing == Kafka (or some cloud-hosted IOT-branded variant) and treat the central broker/coordinators/confluent-cloud crap as "the central log"
To me, the fundamental idea is that changes are recorded in a meaningful format, not the result of changes; What I mean by "meaningful" is important: I don't think SQL statement replication (as implemented by popular SQL databases) constitutes "event sourcing", because the only part of the application that typically consumes this is the SQL database, but if you're storing JSON in a log (or better: some useful binary format) then you're probably doing event sourcing.
With this (perhaps broad) view of Event Sourcing, I would say I have been building all of my database applications for the last 30 years as "event sourcing", and I've never had any trouble building applications (even large ones!) using this model: It lends it self to an excellent user-experience simply by letting the UI "in on it", and the article seems to recognise this:
> Event sourcing needs the UI side to play along
If your user makes a change, and for whatever reason that change takes time to do, it is better to tell the user you're going to do it and then do it rather than wait for the HTTP request (or TCP message or whatever) to "complete" the transaction online because network always fails and bad error messages (the default!) cause users to do stupid things.
But when you use Google Cloud's UI, you can see how nice this can be: You make a change, you can look in the event log to see your changes, you can see if they've been processed or not, and you can examine and share the results simply by sharing the results pages (instead of having to make screenshots or other ghastly things).
I think for many applications this is worth it, but there aren't good tools for event sourcing (in my opinion) so it may be for a lot of applications (especially the ones I don't work on) the juice just isn't worth the squeeze -- but to me, this suggests value in developing those tools, not in ignoring this amazing thing.
Thank you for showing me this list. I had a peek at a few items and it seemed really comprehensive.
It is frustrating to argue whether "Event Sourcing" is good or bad if we have different definitions of what it is, but I don't know of a better name for the thing that I think is good, so it is helpful to point at a body like this to say this is what I mean.
For a long time, I thought about mailing lists: Back in the 1990s we had mailing list applications that you would send email to and they would do various things based on the subject line or the recipient (To) address or something like that, and so you would "join the stream" (subscribe to the list) with one email, and "leave the stream" (unsubscribe) with another; You could publish a new message (it just had to fit in an RFC821 datagram) and processing would both distribute those messages and build "views" of that data in archives accessible out-of-band (such as on the web or in gopherspace). Sequencing messages could be done by simply refusing (tempfail) messages received out of order, and persistent errors would be surfaced (eventually) in the form of an NDR which could itself have a process running on those messages.
I think it is a small thing to imagine a lower-latency email queue, changing the datagram or to manipulate the log (queue) independent of normal processing (such as removing PII -- or in our case, spam messages!) and to create other "processors" for messages for administrative tasks (like setting up customer accounts, or shutting them off for non-payment!) with a little bit of shell scripting, that if you had this kind of experience, most of what constitutes "Event Sourcing" probably doesn't seem very hard, and if you haven't had this kind of experience, that these things may dominate the design (e.g. Confluent+anything) and lead unfairly to a bad impression about "Event Sourcing"
That's not to say I don't think there are hard parts to these kinds of architectures, just that I think those hard parts are usually worth it.
Event Sourcing is absolutely brilliant if you want to build an offline first/distributed system which will become eventually consistent.
A good example would be a tree inspection application where jobs can be pushed out to mobile inspectors who might not have phone signal but can still collect the data. Once synchronised, views are simply additive. More data can be added to cases simply by adding events.
I would absolutely not use ES for a hotel booking system because that data needs to be as accurate as possible in the moment and we've had record locking techniques for over 40 years.
Seems a lot of people want to try ES without really considering whether it maps to the real-world domain of the problem they're trying to solve.
I've followed this rough pattern to build a rock-solid offline-first core for a personal application: https://flpvsk.com/blog/2019-07-20-offline-first-apps-event-... It's got some pain points; in particular it's quite difficult to evolve the schema once your app is out in the wild. But event sourcing is very amenable to TDD, and once all my tests were passing I had a nearly bulletproof solution for my current schema.
Strangely, offline-first functionality is usually not what event sourcing is pitched for. Maybe it's just a a legacy of its enterprise origins, but I often see it sold as a solution for enterprise applications where it's liable to cause more trouble than it's worth.
Event Sourcing can be as consistent as any other data source. For a hotel booking company, I'd have each hotel be a unique aggregate stream, and with expected version assertion, a ReserveRoom Command for a room would only succeed if the latest state had an opening for that room. As soon as someone successfully reserves a room, all further reserve commands fail until another event frees up the room. The expected version assertion on writes guarantees this.
> A good example would be a tree inspection application where jobs can be pushed out to mobile inspectors who might not have phone signal but can still collect the data. Once synchronised, views are simply additive. More data can be added to cases simply by adding events.
This isn't a great use case for Event Souring either. An inspector must be in front of the tree to inspect it, right? And they can see each other inspecting the tree.... Why would several inspectors inspect the same tree?
Assuming they inspect the same tree at different times, you can sync offline data at the row level, and let the last update win, provided that they're entering the same data.
Presumably the application shows both the inspection of a single tree (by a single inspector), and handles missing events gracefully, as well as an overview of the state of a whole forest as different inspection events are coming in. I am in fact building such a tree inspection application, and all the client has to care about is the delivery of the events the UI has created, which is handled by a library. I don't need to concern myself with retrying or persisting my inspection events, and can collate the events in the same manner on the local device than what the backend will ultimately do. In this case the event sourcing model really matches the domain: inspection events that will ultimately be delivered.
You nailed it! Those missing events will slot into place as offline devices come online.
It works for any model where the use case doesn't really matter about up-to-the-minute accuracy, but for case generation and collation over time, it really excels.
We found the ES events wayyy to chatty to push back out to mobile (in 2012) so we push the aggregates back out to the mobiles.
Plus, if you're capturing GPS coordinates, running it through Kibana for real time map reporting is really exciting to watch!
Where did I say multiple inspectors are inspecting a tree? There might indeed be multiple people dealing with a tree. The tree owner, an inspector, a tree surgeon, the local authorities (if the trees are protected).
Synchronising data at row level when offline becomes online is extremely hard to get right and reliable.
With the Event Sourcing model, every new piece of data gets thrown into the bucket , with each new piece of data the view becomes bigger.
> Synchronising data at row level when offline becomes online is extremely hard to get right and reliable.
Why? Are the users in different time zones? If the time on their devices are synchronized, why is it hard to get right? I'm not trying to pick an argument with you, but trying to figure out if you've considered all the possible alternatives.
For a start, you're going to have to define your schema. Does one tree have one row or is it one case has one row?
By updating at row level, you've instantly lost the ability to tell the user what time this data changed, unless you're throwing everything in an audit table and at this point you're already halfway to Event Sourcing.
ES gives us a convenient way to throw a pile of information over time and not have to worry about defining the final schema up front, we can do that when constructing the views of aggregates.
> For a start, you're going to have to define your schema. Does one tree have one row or is it one case has one row?
It seems to me you're defining a schema regardless, but with event sourcing your schema is embedded in your events rather than your database. And you're putting off the reconciliation to projection phase. I get it. But you still need to worry about writing transformations for your data to bring it up to date with your latest schema... That is often the most frustrating part.
Yeah, exactly right, we found it was a few iterations before we had the Base Event Schema defined with enough data for every event.
Of course, with all these articles they miss out that you're not supposed to just do Event Sourcing, it's one part of the system, like we mix RDS and object stores depending on the right purpose.
The big issue, to me, with concepts such as event sourcing is that they are fundamentally presented as software engineering concepts. You get a nice picture showcasing the overall structure and a story or two of how succesful this is, but we never discuss computer science matters. I want to know things such as "what safety properties does this system afford me?", "how is it abstractly modelled?", "what global invariants must I uphold for this system to work?". These are missing in regular conversation with other engineers, and I miss this kind of more formal reasoning.
Ultimately, this kind of discussion should form the basis for the answers to questions of value to the business.
Bugs arise and are solved in these informal systems, but I rarely hear people ask "how do we never get this particular class of bug again?". These questions are kept for researchers (either academic or corporation), never to reach the regular software developer.
Event sourcing is reifying a state machine transitions explicitly (in data). So the properties will depend on how the state machine is modeled.
If you don’t have strong ordering guarantees, it also means your state machine has to accept transitions from/to any state (eg comutative). Otherwise if ordering is important, your event log must have some ordering built-in (exactly once semantics) or you implement re-ordering and replay as past events arrive (a branching model).
What is driving the adoption of an event sourced model other than a need for properties/guarantees that a reasonably normalized relational model can't provide? If the engineering team isn't talking about tradeoffs and the formal properties of the system, that seems like an indictment of the local engineering culture rather than anything to do with event sourcing.
The title here is being overlooked, its not _terrible_, its _hard_. Also this quote from the article:
> The bulk of these probably fall under "he obviously didn't understand X," or "you should never do Y!" in which case you would be absolutely right. The point of this is that I didn't understand the drawbacks or pain points until I'd gotten past the "toy" stage.
Most of what they discussed was indeed not great, but that was due to lack of understanding. I don't think anyone who does or promotes ES says it's _easy_. However, I would argue it _conceptually_ simple to understand (herein lies the fallacy people make simple != easy).
Like any pattern or tech that you are new too, it's going to be hard to build and operate in production (beyond the toy). Everyone takes for granted the decades we've had great tooling and operational patterns for state-oriented databases (mostly RDBMS').
And to be clear, event sourcing is "persistence strategy". What is mostly shown/discussed in the article is "event streaming" along with CQRS. They concepts are largely conflated. Once you understand, you won't be bamboozled.
Not to throw Confluent under the bus, but they have unfortunately written or promoted many articles for "event sourcing" use cases and this has confused everyone. You _can't_ do event sourcing with Kafka. It doesn't support optimistic concurrency control at the key level (where as something like EventStoreDB does natively or even NATS now with JetStream). Kafka if fantastic for what it is designed to do.. event streaming.
I've never heard of the term Event Sourcing. I AM familiar with Event Streaming. Some of what this article talks about surprised me - and I think it is because I don't fully understand the difference between Event Sourcing and Event Streaming.
I have not watched that clip, but as I said above, Confluent isn't a good resource for defining this term since Kafka cannot be used to do ES. I would suggest an article like: https://domaincentric.net/blog/eventstoredb-vs-kafka
that first article finds only one issue with Kafka-for-ES: limited number of partitions. Use pulsar, problem solved? Doesn't have the support or ecosystem of Kafka, but surely more than eventstoreDB.
There's two main reasons why Kafka as of today isn't a good fit for implementing event sourcing:
- Lacking support for key-based look-ups: when trying to materialize the current state of an entity, you cannot query Kafka efficiently for just those events (a partition-per-entity strategy doesn't scale due to the partion number limitation, which may change with KRaft in the future)
- Lacking support for (optimistic) locking: you cannot make sure to reject events based on outdated state, unless you apply work-arounds like a single writer architecture
They aren't solving the same problem. An event sourced entity (should) have a very small scope. It acts as a consistency boundary around some bit of state. Each event in "the stream" for that entity represents a state change. This is analogous to a database row which represents the _current_ state of the entity being modeled for that table.
Like a table, there can be an arbitrary number of rows (>millions), thus you could have millions of event streams. You can't have millions of Kafka partitions. The scope/scale of a Kafka partition is not designed for that small granularity.
Again, I have no issue with Kafka and I think Pulsar is superior in the event streaming/platform space. I also the Kafka ecosystem is impressive and I love listening Confluent Cloud podcast! But it is not designed for this use case.
Most common issue I've ever encountered with event sourcing is that devs submit events that when replayed, lead to invalid states.
And if you're using an immutable log (Kafka/Pulsar/Liftbridge etc.) fixing those mistakes is hard.
And given you're basically recording interim states of a state machine, how do you verify that event Z being submitted after event Y is a valid state transition?
Validating a record against a schema is trivial. But validating three consecutive events against the state machine? I'm not aware of a good way to do that cheaply.
I've done it before using apps that sat in the data pipeline, and maintained state to reject invalid transitions, but that can end up being GiB of state, then you have to persist that across restarts or horizontal scaling.
It's all doable, but it costs. Is the event streaming paradigm actually worth the cost? I haven't been convinced yet.
But, I'm a big fan of streaming DB change events for auditing purposes - but those are coming from software that already enforces valid state changes.
So maybe the answer is - every event has to be submitted via Postgres.
Seems like a fundamental misunderstanding to me. Commands get validated via your aggregate, not events. Events are not allowed to be rejected. Ever. Because they already happened.
Thinking that event sourcing and CQRS are synonyms is a fundamental misunderstanding also.
Anyway, pedantry aside, my point remains thus - it's hard to prevent invalid state transitions being submitted to an event sourcing datastore, and when devs can submit events directly, then these invalid transitions will occur.
ES is really great, but just for a narrow set of service types and when it's applied locally in a closed bussines domain. But as a global system architecture, nope, I tried that once and it only brought more costs.
100% agree. I used it for position management software in equities and it fit like a glove. Effectively, we had a stream of transactions coming from upstream that we aggregated into a cache. Being able to replay events, or step through to figure out why something had gone wrong was terrific.
I then tried to apply this to a conventional request-response web service. It did not fit well. Everything was immediately significantly harder with only a marginal gain.
My rule fo thumb now is to use event sourcing when doing otherwise would be difficult. A justification based off of an improvement, rather than one solving an obvious problem, isn't worth the extra effort.
> The idea of a keeping a central log against which multiple services can subscribe and publish is insane.
Is event sourcing about storing 1 authoritative, immutable, serialized narrative of events, or is it about tying together a pile of microservices?
I think the part where it goes wrong is where we try to subscribe to these events (i.e. push). This adds delivery semantics to the mix and forces a lot more constraints at vendor selection time. Being notified that something occurred is not part of event sourcing in my mind (though it is clearly one logical extenstion).
I've built these systems. One major problem is that both scenarios are called event sourcing by different folks. The Confluent (Kafka) people will tell you that ES is a streaming approach while the EventStore folks will tell you that it's about replays, aggregates and immutability. Different blogs refer to wildly different use cases with the same name and almost no one states their assumptions up front.
Don't get me started on explaining snapshots, why you don't need snapshots until replays take 1000ms, why replays won't take 1000ms until the topics are thousands of events deep, why the team won't believe that this isn't a problem, and why you will definitely need snapshots if you keep scaling.
Lots of hype about event sourcing and CQRS, usually in combination. 99% won't need it. Has the potential to ruin your team and even company unless you have a well-above-average engineering team.
This. Why do people try unproven tech in a business? A well-above-average engineering team isn't going to go for this. I would leave before this shit show even got started. The problem should be begging for this pattern if it fits. Like those mentioning the use of ES for high speed stock trading, for example.
One of my unwritten requirements for any system is being able to drop in inexperienced devs for maintenance...after I leave. Most people are likely to shoot themselves in the foot by not understanding (or even being aware of) database isolation levels much less grokking an event pipeline.
This matches closely with my experience working at a company that was mostly built on event-sourcing. I would add a few things, some of which are touched upon in the article:
- The general problem with event-sourcing (at small scale) is that it forces you to have to think about and handle many things that can mostly be ignored with a more typical persistence approach. This overhead of complexity was one of the factors that (IMO) killed a company I worked for. Event-sourcing feels relatively "raw" in this way. However, this also provided a valuable learning experience as a developer.
- If you intend to embark on an event-sourcing journey, read Versioning in an Event- Sourced Systemhttps://leanpub.com/esversioning/read , even if only to better understand the complexity of this one aspect. I agree with the author of this submission that it is hard to fully grasp the complexities of event-sourcing without trying it, and hitting many road bumps along the way, since it's not an especially well-trodden path.
- Being able to redefine the interpretation of an event as the system evolves feels like a superpower at times. Similarly, being able to bring a new part of the system online, and have it based on a rich history of everything ever done in the system, is very cool. One of the major benefits of event-sourcing is how adaptable it is in systems that change over time.
- Beware CQRS and eventual consistency. This adds a lot of complexity and is not a prerequisite for event-sourcing. It's also telling that ordinarily-simple problems to solve, like preventing duplicate username registrations, are hand-waved away by CQRS proponents: https://web.archive.org/web/20191101010824/http://codebetter... . You will face this class of problems, and you may not be able to hand-wave it away so easily.
I work at Temporal[1], where we solve a bunch of these problems for you by cleanly abstracting the critical event sourcing bits under our SDKs. I think everyone who tries event sourcing eventually rolls their own framework or takes one of the shelf, and Temporal is the most battle tested one I've ever found.
> The idea of a keeping a central log against which multiple services can subscribe and publish is insane
> In effect, the raw event stream subscription setup kills the ability to locally reason about the boundaries of a service. Under "normal" development flows, you operate within the safe, cozy little walls which make up your service.
I wonder if this isn't more a case of the "cozy little walls" turning out to be a lie.
If your business requirements are that this user action has to trigger those half-dozen barely related background tasks, then your design will always be messy and complex, no matter if you use service calls, event queues, polling or whatever else.
In such a situation, I think a better question would be which design lets you most easily reason about the entire system - e.g., what is the dependency structure, how is the causal chain organized, etc.
Of course there are also "secondary" concerns, such as: Which design lets you make changes to the system easily, can you still easily develop and test things in isolation that can be isolated, etc. I think services and events actually have some upsides here (but I still may be in the "seduced" phase...)
I think this calls more for a better API/dependency management (and treating events as full-fledged APIs!) and a good tracing solution which lets you follow an action through different services (and event queues).
> You're free to make choices about implementation and storage and then, when you're ready, deal with how those things get exposed to the outside world. It's one of the core benefits of "services". However, when people are reaching into your data store and reading your events directly, that 'black box' property goes out the window
Again, I would absolutely treat events just as much as a public API as REST. Only because the causal realtionship is inversed doesn't mean you can throw away schemas, versioning, dependency tracking etc.
I've worked on a few event sourcing systems, and the only ones that have been truly successful are the ones where the system was written from scratch. Trying to retrofit systems never seemed to quite work as expected.
Also, the "from scratch" systems, seemed conceptually easier to not only understand, but extend, because you were forced to write code in an idiomatic and consistent way. Everything was an event (albeit, a "command" and an "event" confirming 'execution' of the "command").
I am currently looking at a new iteration of our product that uses an event log as the principal data store. I agree that if we start from "zero" and don't even provide the option to build things the wrong way it would go smoothly. "everything is an event" is very powerful if its not just some add-on policy.
For us, the entire working set would be able to live in memory. Larger objects would be left as pointers to the actual data in the log. Snapshots of working set would be taken on a daily basis in hopes of keeping recovery under ~5 minutes during service restarts or updates.
> Snapshots of working set would be taken on a daily basis in hopes of keeping recovery under ~5 minutes during service restarts or updates.
Exactly this. We took hourly snapshots (for a trading system), and could restart the system in a second or so.
Another nice side effect of an event driven system is that you start to think about what other data you can put in there. And then you realise you don't necessarily need a database any more, which makes the architecture of the system simpler.
There are so many benefits working with a data-driven system. Once data is in the log/ledger/whatever you call it, it's immutable, which leads to parallelism if you want it. Ensuring the system is idempotent also means you get (easier) failover as well.
> And then you realise you don't necessarily need a database any more
This is precisely where we want to be.
We are already exceptionally minimal when it comes to database usage - SQLite is all we use today. If we are in a place where 100% of the business facts go to an immutable/append-only log, we can handle the entire persistence mechanism with a simple filestream write command. Not having to worry about mutating prior data solves a large portion of database engine problems automagically.
I worked on an event sourced system once, many years ago.
I'm not saying I hate the separation of write and read models, and all the challenges that brings (heck, the platform I'm working on now does that, and it's not a big deal), but it was massively bloated and overly complex for a system that would collect some data from users, submit it to a panel of third party services, and display results back to those users.
The reasons for going this route were complex, and not entirely clear to me, although I can pick on two: firstly, they'd run into performance and scalability issues with their existing CRUD-ish SQL Server-backed implementation and, secondly, Thoughtworks told them it would be a good idea.
I don't want to handwave this away but I've encountered enough insoluble SQL Server performance issues over the years to be confident those issues could have been solved much more simply (and the same is likely true for Oracle, Postgres, et al). I'm not suggesting that intractable performance issues don't exist on relational platforms, just that most times people think they've encountered one, they're mistaken.
Would I use event sourcing again? Maybe. But I'd need a really strong use case for it, and I'd certainly kick the tyres on any technical organisation that was using it before I joined.
I think there's a middle option between (1) building your application up from a low-level stream-of-events data model and (2) having your application full of logic to snap from one high-level state to another without a good data-layer audit trail:
I'm talking about indexes.
Yes, the index part of your database is amazing, because it constantly snaps itself into various states, and you know exactly why: because it needs to maintain itself in the same deterministic relationship with your models.
I think that if databases would support more complex and composable indexes, like providing the ability to index join queries, then application programmers would be more nimble and bug-free working with their data.
Silly question: isn't the way you're supposed to handle the observer pattern thing to be that you can follow processing chains using correlation IDs? Ie. the initial event of a dataflow gets tagged with a unique ID and then there's some database magic that lets you track that ID through the system? Like, I can see where that would go wrong once you merge multiple events together, but that seems a different failure point than described here.
And while we're at it, the whole point of eventsourcing is that they're not supposed to model "internal" process state? They're supposed to be the ground source of truth; whatever the process has to do to mangle that into a model it can work with should be no skin off the events' back.
The rest seems very, very correct to my experience. Especially not knowing the pain points until you're past the toy level.
Eh, you could do it manually by hanging a projection on $all and republishing to streams per correlation id. But it's cheaper to let the db do it internally.
So it's not a strict dependency, more in the spirit probably of an optimisation: "here's how you own it, but most people let the substrate do it for you..." type thing.
A whole new bunch of people for whom the article hits home maybe. In any case - it's a new one to me, and thanks for the link to earlier discussion, v interesting!
Adding something like change-data-capture to a system and using that for some of the purposes that event sourcing touts seems like it goes a lot further and doesn't require making a complex, hard-to-understand system. Having a CDC system to publish events and record those can give the biggest benefits and sit alongside a much simpler system. A downside is that the real-time feature goes away.
This is a balanced article and I appreciate the author’s treatment of the challenges. Part of my work is helping enterprise shops with n-tier architectures decouple services to allow for more resilience end to end. Everything the author states has crept up, some of the issues I consider the “nominal cost of resilience and independence”. Other issues are easily mitigated, such as the yak shaving required to get off the ground. Serverless solves the infra/config challenges. If an enterprise shop isn’t open to serverless I have no interest in discussing event sourcing or CQRS with them. It’s too much work otherwise.
For shops that are being crushed by fluctuations in concurrency, challenges with the uptime and p50 of partner teams, don’t have much of a choice but to go event sourced, optimistic, and eventually consistent.
I’m certain a great enterprise IT shop with great leadership could mitigate the challenges of service growth and instability with some clever optimizations of their current stacks but unfortunately I have to meet a shop that is uniformly competent.
Event sourcing helps with the competency gap problem. If every team’s data is on the outside my team can work around your lead time for changes or general ops incompetence. If inventory is down, fuck you I’m still taking orders. The extra work of maintaining a projection of your inventory events makes good business sense because time kills all deals.
What the hell is event sourcing? I read the first half of the article and gave up. Sounded like one of those distributed log things like kafka or maybe a message broker like rabbit?
Yes, could google, but now am discouraged. If you're going to write one of these, at least link to the wikipedia page:
The issue with GDPR is right to erasure, which is tricky in conventional systems. But in ES systems, naively implemented, it is not possible. If you are writing PII in plain text, an immutable log won't let you remove it. OTOH I have two options: store PII in a separate database, and just delete or nullify when requested; or use per-user encryption, and delete the key.
Kafka and friends can be used in reactive pattern paradigm ( understood as here https://www.youtube.com/watch?v=eRxLfUIMJwk&t ) - only partially overlaping ES idea, which in general has much more sense to me.
Strange that you blame this on OOP. I see event sourcing being used particularly often in FP environments. Because of course the model of streaming immutable data fits very well into the FP paradigm.
I wrote the blog post you cited (thanks!) but I disagree with both statements: that is not what is meant in the article.
1. I don't think Event Sourcing sucks - I think we are lacking accessible technology for supporting it.
2. For most difficulties encountered in Event Sourcing, I would rather blame distribution taken to the extreme.
99% of the time when ES sounds like a good idea, the answer is to just use Postgres. Use wal2json and subscribe to the WAL stream - if you really need time travel or audit logs or whatever, they'll be much cheaper to implement using WAL. If you need something more enterprisey to sell to the VP-suite, use Debezium.
Event sourcing sounds so awesome in theory and it is used to great effect in many demanding applications (like Postgres! WAL = event sourcing with fewer steps) but it's just too complex for non-infrastructure software without resources measured in the 10s or 100s of man years.