Hacker News new | past | comments | ask | show | jobs | submit login
What they don’t tell you about event sourcing (medium.com/hugo.oliveira.rocha)
234 points by jackthebadcat on Aug 22, 2018 | hide | past | favorite | 77 comments



The last section on Operational Flexibility and the inability to change the event history raises a very good point.

Like most of the issues, the solution requires experience to know when you are at the Goldilocks point (Just Right). This specific issue has a lot in common with managing database migrations in django or any other migration system.

The ideal situation is to create migrations that can always be rolled back, but sometimes this is not possible to do operationally. For example, a schema change that restricts a field from NVARCHAR to INTEGER can only be generically rolled back if all of the unconvertable data is persisted. This can be mitigated by structuring the database to avoid these dead-ends, and that really is only gained by hard-won experience.

The problem with undoing operations via new events is the same thing -- unless you have foreknowledge of this kind of problem, it is very easy to accidentally create events that perform un-undoable actions. A very simple example of a problematic event would be something that modifies a foreign-key relationship -- let's say it's a digital asset in a game and you want the ability to transfer ownership of the item from one player to another.

The simple solution is a event like SetAssetOwner(asset_pk, player_pk). This would set the item's player foreign key field to the player's primary key. Easy. However, you have lost knowledge of where the asset came from and cannot undo this operation. A better solution would be to make an event SwapAssetOwner(asset_pk, owner_pk, recipient_pk). Yes, the owner_pk is technically redundant, but it provides a check against someone trying to steal an item with a maliciously crafted SetAssetOwner event that performs no checking. Even better, this operation can be reverted by sending the same event with the owner and recipient arguments reversed. Since these properties are part of the event message, they will be persisted in the event history and all of the information to undo the event is self-contained.


Basically implement double entry accounting.

I think ideally you want something like Rich Hickey speaks about when he speaks of Datomic. An append only database. You can see what the previous values for that row were, along with schema changes.


I worry about GDPR with respect to Kafka and other append-only databases.


You can encrypt events and throw away the keys, if data should be made inaccessible. Of course, it adds complexity. But its already being done.


This does not work for many data models, for both technical and economic (e.g. increasing costs by multiple orders of magnitude) reasons. Many real systems would require hundreds of millions of active encryption keys, encrypting data that is smaller than the encryption block size.

Every database architecture that exists today is designed with the deep assumption that scalable fine-grained deletion will never be required, largely because we don't have good computer science for how to do it. As experienced database operations people know, if you are required to do this kind of delete then the sane way is to completely rebuild your storage with the data to be deleted filtered out, if you have ample excess capacity -- it is often much faster than editing the existing storage. For some large-scale systems, there is no plausible solution.

This is an interesting computer science challenge -- a database kernel designed for efficient deletion -- that I've thought a lot about over the last year dealing with GDPR compliance. It definitely isn't a thing that exists today and encryption doesn't help.


Every database architecture that exists today is designed with the deep assumption that scalable fine-grained deletion will never be required, largely because we don't have good computer science for how to do it

Can you explain this? Are you talking about something different from deleting individual rows?


A "row" is a logical abstraction. It doesn't exist as a physical thing in many database systems, especially modern ones.

Furthermore, "delete" is commonly defined as "will not be returned in a future selection operation" -- there is no implication that any data is physically deleted and permanently inaccessible. Avoiding physical deletion is done for very good technical reasons to support features and performance that everyone is accustomed to in a database.


Idk why people are saying that ES data is always immutable. They can be by default sure, but if a facility is useful to change the history, why not?


If you need to change history you can just create new events that accomplish your mutation -- and even mark them as a type 'change history' or some such obvious identifier so when you inspect the event stream you know exactly what you are looking at.


I agree, this is 100% normal in accounting, as the earlier thread pointed out. If there is an error you add journal entries to the end to make the adjustment. It is funny that this sort of thing was invented in accounting in 1494.


> If you need to change history you can just create new events that accomplish your mutation -- and even mark them as a type 'change history' or some such obvious identifier so when you inspect the event stream you know exactly what you are looking at.

"Right to be forgotten" isn't the same as "right to be redacted".


To avoid having to implement the whole logic for any event you'd want to revert, you can have a "cancel" event that points to another event.

You'd read the cancel events first and simply skip the events they point to.


In many cases they are legal proof that something happened. If you can mess with history, proof can't be used in court.


Good article. I've spent the last year migrating to an event sourced system, so thought I'd share some thoughts.

On the eventual consistency point, I've found you can get quite far with having the read model managing the race condition. This probably doesn't work everywhere, but in our system, multiple users can accept an invitation, so we have something like `InvitationAccepted{invitation_id, user_id}`. It's possible that multiple users might accept the invitation at roughly the same time, but the command-side doesn't really have to be concerned with this - it can happily allow multiple users to accept an invitation. It's up to the read model to ask, 'has this invitation already been accepted?' - if not, the acceptance is successful and will be indicated when queried, otherwise the acceptance is unsuccessful (and as a bonus, we can separately record who unsuccessfully tried to accept it). From the user's point of view, when they accept the invitation, they see a spinner until we confirm with the read model one way or the other (this could be done by polling the read model, but in our case we have an event sent back to the client).

Coming up with the event schema and versioning/granularity are hard. We have version numbers on all our events to make this a bit more manageable/explicit (`InvitationAccepted1`, for example). Storing events in a relational database does make it a bit easier to go back and edit/upgrade/delete them (sort-of cheating, but also relevant for GDPR). Also, I think we're going to end up suffering a bit from the 'whole system fallacy', but at the moment namespacing all the events (keeping in mind their expected volume) makes it a lot easier to manage.


> It's up to the read model to ask, 'has this invitation already been accepted?'

I feel like you've skipped over the interesting part of your strategy here. If it's an eventually consistent system, what keeps the read model from having the wrong answer to this question?


Not OP, but I am working with a CQRS system.

In CQRS eventual consistency does not mean that we have multiple servers such that we have 2 servers with 2 different answers. It means from a command is issued to an event is propagated to all read models, there is a delay.

You need to handle the race condition at some point or another. From a CQRS point of view, a user accepting an invitation is just an event like any other event. What happens based on that event must account for the possibility that multiple users have accepted, and it's a rather straight forward thing to solve with an ES. The "accept" event with the lowest sequence number is the first.

Having the read side handle it would probably mean that when you have a read model for accepted invitations, you ignore all but the first accepted of an "invitationId".


I've been planning out an event sourcing like system for our healthtech startup, and you're definitely right re namespacing and versioning.

I explicitly keep a version in the event - which is a date. It's similar in how Stripe versions their API. We're also planning to handle the events in the same way as stripe's API: each event has side effects; the side effects may change depending on the version and each version has its own application logic (cascading, so you can have 2018-08-01 run all of 2018-07-30 plus its own changes).

This lets us replay events as they happened, run an event using two different versions and perform only the diffs etc.

Our system is probably not a typical CQRS/event sourcing setup.

The event system itself idempotent: you take an event with all input data necessary to run the event (form data, necessary current state), so the system can run independently. This means that every event is typed such that the input data dictates what is necessary.

The event handler validates the event, returning errors if necessary.

Then, the event handler runs all side effects and returns operations to perform: update model X attributes to Y, insert new record Z.

In effect, we go:

User -> API -> (generate event) -> Event Handler -> (error|response) -> save event and side effects in a transaction.

This means our DB is a cache of the event and all previous data, so we're not really event sourcing — we're audit trailing.

The main benefits are:

1. Medical records are complex and we always need audit trails.

2. If a doctor submits a prescription, we can show all side effects that happened for visibility (ie. this triggered a lab task, push notification, sent this message). We can verify this in the UI and see what happened for each patient without relying on assumptions.

3. Engineers know that the API produces events and can look up exactly the side effects that happen when an event occurs (we're using Rails for the API logic right now and this isn't always obvious).

4. We can ensure that we validate when an event happens based on input and current state without complex code, catching edge cases.

5. We can choose to save the event and side effects or not. This lets us "preview" actions or "replay" actions without actually changing any world state (you toggle a "test" flag in the event which also means the event handlers know not to trigger outside side effects).

6. The "side effects" response from the event handler can be sent to a websocket observable and consumed by frontends, ensuring that the doctor UI always has an up to date version of patient data.

Random thoughts:

- It's really just a framework for the logic of an application controller that's typed and ensures everything is consistent. Plus, similar to Stripe, it allows us to version events and write migrations/upgrade paths etc.

- What about conflicts? We have a plan to use hashes of the previous data to ensure consistency with medical records: if you're modifying fields A, B, C, you send over a hash of the previous data for A, B, C alongside the request. If the event handler can't verify the hash the data must've changed in the meantime.

----

We're producing events now but the handlers aren't yet in place, so this is currently still being planned. Essentially, we're using the API as authentication, authorization, routing/HTTP management, transaction/database management while the event/controller logic is being placed into a structured framework to ingest form data, current state and produce output.


This is exactly what we did in our system and it worked wonderfully. It also has the side effect of avoiding locks or contention when doing such mutations.


Regarding eventual consistency, a CQRS\ES system can also be synchronous, or partially. You could have listeners for events that need to supply a strongly consistent model and others events that feed parts of the system that don't need strong consistency.

"However the events in a event store are immutable and can’t be deleted, to undo an action means sending the command with the opposite action"

Well they don't have to be immutable. I don't see why you can't update/migrate events.


The author has explained the problem with eventual consistency and the CAP theorem, but he is trying to blame the problem at least partly on CQRS/ES - the problem will exist independently of what pattern you use if your system is distributed.

CQRS/ES is a pattern and doesn't need to be shoe-horned into every solution.

I work at Transport for London and we have a CQRS/ES system for managing the data driven design data. It is synchronous and is incredibly useful for ensuring it has both strong business logic and a fast readable side, while providing auditing for free.

We still have problems with EC and CAP, and they are not due to CQRS/ES. Make the system distributed and you will have to handle all the new scenarios. Those are the trade-offs.


> can also be synchronous

Then you're giving up several of the benefits of CQRS, and might as well just not bother with the additional complexity.

> Well they don't have to be immutable. I don't see why you can't update/migrate events.

The events are your source of truth. When you "migrate" your source of truth, you're in somewhat dangerous waters (and if you're using CQRS for scale, a 1% failure to migrate data might be millions of records). You also now have the issue that your source of truth is being migrated, and probably can't accept writes (or you need to emit writes in both new format and old format for whenever your change over happens)


>> can also be synchronous > Then you're giving up several of the benefits of CQRS, and might as well just not bother with the additional complexity.

Even without asynchronous read/write, there are still benefits worth the (arguably, small) additional complexity. For instance, the ability to add new functionality without having to migrate existing data is amazing.


You still have migrations, but they exist on the read store, which means they can be done semi-transparently to clients (pause writes and let them queue up, migrate existing read store to new instance, point read calls and indexers to new store, resume writes).

CQRS adds a lot of complexity. Sometimes it's absolutely worth it, especially if you've already invested in the expertise and tooling to support it. It drastically changes the scaling math, both on the low end (at the very least you need a write store, a read store, a queue, an indexer and an api) and on the high end (you can scale any part of the system as needed in relative isolation). You add in timing issues, rollbacks, asynchronous error handling, and delays between reads/writes.


I don't think you necessarily have to have all this in a useful CQRS system.

E.g., here's a real-world example of a general pattern in which CQRS is pretty simple and useful:

A walking/running app which tracks distance, time, and other relevant information over the course of a workout.

It collects a series of events like "location changed at time X", "user started workout", "user paused workout" etc., over the course of the workout period (actually starting before the user officially starts the workout), and converts these to time, distance and other stats.

This fits CQRS really well since the input is inherently a series of events and the output is information gleaned from processing those events.

You get a full CQRS system by simply fully logging the events.

The advantage is that you can go back and reprocess the sequence if you want to glean new/different information from the sequence. E.g., in the walking/running app you could, after the fact and at the user's discretion, detect and fix the case where the user forgets to start or stop a workout. Or recalc distance in the case where you detect a bug or misapplication of your smoothing algorithm, etc. Or draw a pace graph or whatever.

In all these cases you can process the events synchronously.

I put this all in terms of a workout app, but there is a general pattern of an event-driven session-based activity or process, where you may want/need to derive new information from the events and the cost is to log the events (in a high-fidelity form so they could be fully reproduced.

Whether or not you need to use queues and distributed data stores is an independent decision.


That's not "Command Query Responsibility Segregation" (CQRS). That's modeling your data as a time series - which is a totally valid and perfectly useful model in many cases, but has nothing to do with the architectural pattern known as CQRS.

Martin Fowler gives the following simple definition of CQRS:

> At its heart is the notion that you can use a different model to update information than the model you use to read information.

CQRS takes data (events, time series, plain old records, whatever), stores it into a write store which acts as a singular source of truth, and queues that data to be stored in a (usually eventually consistent) read store as a projection - not simply duplicating the write store, but building a read record meant to be consumed directly by some client - with potentially several projections, one for each set of client needs.


But what is the distinction between what Fowler describes and what I describe? There's not really a contradiction with what you describe either.

I have a distinct write store and read store, with very different models. As you say, the write store is the source of truth. Since the read store is updated synchronously with the write store there's no need for a queue between them. Indeed there are also multiple projections for different client needs (e.g. the pace chart, vs. workout progress), though in this case I don't generally need them until after the sequence of events is complete, which simplifies things.

Maybe we're just have a pointless debate on semantics, in which case, never mind.

It's just that I see this as a quite valuable pattern without necessarily bringing distributed stores into it. Indeed, part of its value is that you can start simple and later extend it to a scaling distributed system without disrupting the whole pattern.


You're right. Nothing about CQRS demands any kind of asynchronization or distributed system. It is quite simply a system where you have different models for updating and querying the system. You don't even need to have an event store for it to be a CQRS system, but it makes so much sense that I have a hard time imaging using CQRS without and ES of some kind.


No need to pause writes or migrate stores. You can have the old and new versions of the read model co-exist and read from the same event stream without disabling writes. Once the new version is deployed and running, you shut down the old version (isn't this already your deployment model?)

It does change the scaling math, but don't automatically assume that every ES+CQRS system is intended for thousand-writes-per-second, terabytes-per-day kind of scales. If the system stays at ten-writes-per-second, megabytes-per-day, a read store (beyond in-memory projections), a queue or an indexer are not necessary.

As for asynchronous error handling, and delays between reads/writes : why would that be a consequence of ES+CQRS ? It would be a consequence of implementing ES+CQRS with asynchronous or eventually consistent behaviour...


I also share same opinion based on my experience. Events can be modified and deleted but that must be an exceptional situation (GDPR and other compliances, etc.). But even if it's exceptional you have to provide a clear and easy way to do so and that increases complexity of the solution by a lot.

Another thing is strongly consistent models, there may be valid requirements in some problem areas to have a strongly consistent and normalized model and use it for command validation. This helps especially well in the case when all requirements are not known upfront and/or business domain changes very frequently. A small change in business may require to completely redo the aggregate roots and logic if you follow standard approach, this is very expensive. A better decision could be to use a normalized SQL database instead of an aggregate root. Such approach may be more flexible in certain cases and have it's own benefits as well as cost and drawbacks.


Encryption + lose the key policies seem to satisfy the GDPR. So you don't have to actually delete an event, you can just give up your ability to read its payload.


If the decision to abandon strong consistency involved careful analysis of the performance/maintenance trade-offs, then by definition the lack of consistency is less expensive than keeping a consistent but low-performance model, and you're just paying the price of having to solve a Hard Problem.

But if strong consistency was abandoned because someone wrote general statements in favor of eventual consistency...


Is it just me or the article never presented a solution to the posited problem: ”Since each entity is represented by a stream of events there is no way to reliable query the data. I have yet to meet an application that doesn’t require some sort of querying.”

The solution is said to be CQRS, and nothing in the article shows how you solve that.

If you unwind the unnecessarily winded florid poetic waxing, it all comes down to “dump it to a database, and query that”.


"dump it to a database, and query that"

Yes.

In some cases, "dump" is a fold/reduce, and your database is just an in memory data structure, and depending on how much latency is permitted by your service level objectives you might cache the data structure as opposed to regenerating it every time.

There's no magic.

The pattern is analogous to what you would do if your book of record were an RDBMS, and you had to run graph queries. "Dumping" the data into a graph database and running the query there would be a tempting solution, no?


In practice it is typical to maintain a snapshot of the current state in a more-query-able form. In some systems, event writes and their corresponding snapshot updates occur within a transaction.


Hopefully bi-temporal tables get added to Postgres soon.

I think it would cover a few of the use cases that people are turning to event sourcing for(excluding scale).


I don't see much discussion of event-sourcing simply using a SQL database (i.e. skipping the CQRS part). This would allow you to keep your CP (strongly-consistent) semantics.

While this clearly wouldn't work in high-volume cases (i.e. where you _actually_ need CQRS), it seems like this would be the simplest option for many systems. I see a lot of articles advocating for immediately jumping into CQRS, which seems like a big increase in architectural complexity.

Does anyone have opinions/experience on this approach?


I went with this approach on a recent project, for two reasons: * Tracing/history/auditing * Structuring the service around these events, it became trivial to add new "event types", rather than expanding some big hairy PATCH endpoint or similar. In other words, when I needed a user entity to be able to belong to a group, I just added a new event type, "user/join-group". Having a single endpoint for all events had two other nice benefits: batching became trivial, as well as doing several events transactionally (all events in one requests are processed in one transaction).

It's been running in production for a couple of months now, and it's been working great! The main drawback I can think of was that it obviously didn't work well with Swagger out-of-the-box, had to do some custom handling for showing all the available event types.


Don't you still need a queue in your example? do you lock the whole table when you insert a new record to the table? can you elaborate how this solves consistency?


I don't think you need a separate queue; if you have an "Events" table then you can just write everything there.

It solves the consistency problem because you can create your event inside a transaction, which will rollback if another event touching the same source is created simultaneously.

E.g. if you have these incompatible events in a ledger:

CreditAccount(account_id=123, amount=100)

DebitAccount(account_id=123, amount=60)

DebitAccount(account_id=123, amount=60)

You'd want one of the debit transactions to fail, assuming you want to preserve the invariant that the account's balance is always positive. You could put the `account_id` UUID as an `Event.source` field, which would allow you to lock the table for rows matching that UUID.


If your idea in the example is that the second "debit" is created by another transaction while your transaction is in progress, then this will not work out. Firstly it requires a dirty read, which is nothing I would rely on in a transaction. Secondly, if the dirty read works, assuming the outcome of several rows is just a read operation, which forces you to rollback on the client and still leaves a window for inconsistencies if you decide to commit. Maybe SELECT .. FOR UPDATE can do a trick here, but that is like giving up life.

To round this up: RDBMS are bad for queue like read semantics. All you can do is polling. Which is even worse if you end up being lock heavy.


No matter how you model things and no matter what technology you use, a race condition like this needs to be handled one way or another. You either handle it such that the data is always consistent, or you handle inconsistent data.

You can use an SQL database with transaction and locking to ensure that you will never debit money that isn't there. Or you can save commands in a queue that only a single process at any given time (that incidentally includes the SQL scenario). Or you can use a distributed consensus algorithm with multiple stage commits. There is no way around it.


I’ve implemented this.

TLDR: orm’s + large transactions resulted in a lot of unexpected complexity.

At some point you get really big transactions because one write triggers five process managers which all trigger more writes and so on and so on. Performance was not a problem but I was surprised by the complexity of these big transactions in combination with an orm.

I dont have a concrete example but over the course of two years we have encountered multiple bugs that took days to solve. Theres one of these bugs that we fixed without identifying the root cause until this day.


Thanks for the perspective -- given what you know now, would you have built it differently if you could do it over again? Or was it the right call for the stage of your system, even if there were pain points?


Strong consistency is possible with this model, consider for example git.

One real database that works like this is Datomic, which is competitive with SQL for most kinds of read-heavy data modeling loads that SQL and CQRS is used for.


All the issues the author talks about are valid, however, the idea that no one warns you about them is wrong.

If you do even the most basic research into Event Sourcing you will find people warning of these trade offs (foremost amongst them Greg Young, the person who originally set out the idea)


I've never quite been able to get my head all the way around "Eventual Consistency". I don't understand how actions that could conflict or resources that could be contended are supposed to work? At some point, something has to say, given two actions A and B that are in conflict, the later action B must fail.

So, where does that happen? When getting written into the read model? Then what, it emits an event saying clearly that the previous action failed? All while there's a client waiting for results?

I'd love to see a worked example based on discrete resources. Such as two people, a box, and a ball; where the ball can be held by either person or in the box, and a person can take the ball from the box but not from another person.

I find the concept intriguing but I don't really get it and I haven't been able to identify in any of the writing where "the buck stops".


Conflict resolution is a separate problem that isn't addressed by Eventual Consistency. If I make a write in an EC system, that write may eventually be accepted, it may be rejected, or it could be resolved through something smarter like a CRDT. EC just says that I won't know that immediately; that different parts of the system can have different views of the truth at the same time. And for most business systems that's usually okay, if you're talking about consistency that resolves on a good-enough timescale.


Thanks for answering. Is Eventual Consistency a necessary property of a CQRS/Event Sourcing architecture?


I don't think it is, but—as usual with EC—if you don't want it there's a performance cost. The way you'd avoid it in a model where writes go into one queue/connection and reads come out somewhere else is when you'd do a write you'd have to block and wait on that write being acknowledged on the reader side (or you build in some sort of side channel to whatever your write handler is so it can tell you when a write has been accepted, if you're doing conflict resolution on the write side (some systems don't and push it off to read-time)).

But at that point you're just simulating something you probably already had before going to CQRS/ES, so you would only want to do that in very select cases. Otherwise there's really no point to the architecture...


I've worked with event sourced systems quite a bit in production and at scale, and have written a book on Akka and another one on related topics. While I have been an advocate of the approach, My experiences are guiding me away from implementing Event Sourcing in many use cases (especially where the entities are long lived). While CQRS is more complex, I'm more likely to implement CQRS without event sourcing, where an entity is responsible for its own persistence using whatever mechanism we deem suitable, and emits events that can be used for the view (or the data can be viewed as read only.) Ultimately a bounded context is there, and you send it commands and get out events. You have to evaluate the recovery and persistence mechanism based on your needs. Exactly once delivery is a pipedream so it doesn't matter how you're persisting the event and delivering it, evaluating line by line, crashing somewhere it's going to be possible to emit an event twice somewhere (eg if the outbound projection emits the event but doesn't persist its offset.) You need to deduplicate somewhere to get exactly once processing semantics. There _are_ simpler approaches to persistence that work just fine with the same delivery guarantees that work fine in the place of Event Sourcing. Nobody talks about any alternatives though - everyone defaults to event sourcing. My intuition guides me to look at other options having been burned one too many times, watching organizations collapse around 100% event sourced applications, etc.

I'm still working on event/message driven systems (today with Elixir mostly) but I've started to make architectural compromises to move away from especially event sourcing. Event Sourcing + CQRS may be prescriptive but it's very hard for new developers to pick up and understand the layers of abstraction underneath eg Akka + the Persistence Journal. And I'm not sure I can trust many of the open source libraries outside of Akka to be honest. I've had to dig into the depths of postgres journal implementations and apply windowing to journal queries for example because they weren't burned in at the scale I was working with (partially because I inherited an application with a single entity in a context which had many million line long journals - this highlights a design error though but hopefully you can see my point.)

You don't _need_ to use these patterns but you can still apply DDD and event/message based abstractions, and publish events. An entity can write its state to a record and then apply the state in memory as well without using a journal given you handle exactly once processing semantics correctly. This means there are knobs and dials. The problem with event sourcing in the greater picture is that it's descriptive of an approach, and there aren't many clear alternatives that people are talking about that work in similar system designs. If you have very long lived entities, or only a few of them, it gets especially difficult to keep the system alive over time, but for those use cases it doesn't mean you should stop receiving commands and emitting events.

You always here about the idea of the approach, never the reality of maintaining these systems, or the inappropriate use. In one implementation, there is one entity that receives thousands of events a day, and lives forever. How do you maintain the journal while changing the code, keep it alive over time? I've watched event sourcing and CQRS sink projects and teams. I've watched well paid contractors unable to figure out how to cluster and scale these systems. The barrier to entry for people to become effective can be high and you should understand the long view in terms of people required and cost over time and validate the approach for your use case very carefully. Again, the fact that everyone talks about event sourcing and no closely related alternatives makes it seem like the gold standard or the only option but there are other (simpler) ways to deal with your persistence in an overall similar architectural approach.


I've re-read your post a couple of times, but still find it confusing. It reads like you're an early-adopter, and got burnt multiple times by design flaws, and people not being familiar with it.

Suppose ES/CQRS is designed well, and a well-understood mechanism, would you still move away from it? For which scenarios? I can imagine exactly-once delivery not being a problem in all scenarios.

And what is the problem with long-lived entities? Solely the fact it takes longer to reconstruct current state?


I'm only saying it isn't the only approach, even if you're keeping the source of truth in memory. It's the most prescriptive approach for sure so is the one people tend to go for first but there be dragons in managing especially long lived journals. Migrating the journal over time, figuring out how to truncate it (especially if there is an initial creation command/event that describes the rest of the life, or periodic updates to the structure - you have to keep all of that data forever potentially to know that it's going to wind up being correct.)

And yes long recovery times are an issue unless you want to keep every entity in memory forever and ever, or you can tolerate extremely slow turnaround times. There are knobs and dials here that are subtleties that people won't think about until after they launch.

If you have a hammer everything looks like a nail kind of thing. There are other ways to handle problems that are only marginally different in how they persist and recover state, yet are far more usable for certain use cases. I'm extremely skeptical when I see someone gung-ho for event sourcing if they've never used it though. I tend to look at the problem very hard to see what else we could do for persistence while still maintaining the "source of truth" in memory.


The GP sounds exactly like the "you don't need microservices for everything, you can have well built monoliths too", "not everything must be a single page app", or "not every datacenter should be replaced with a cloud one".

ES specifically has a large list of drawbacks, so not everybody should use it. It would be great if people could tell about how there are many changes of gray between "CRUD-only we forget everything that is not current" and ES, but in a hush to make a point people act like those are the only known possibilities.


Elixir is fantastic, especially in these use cases because it allows for stateful in-memory representations of an entity with less development burdens than elsewhere.

I find it useful to have a GenServer for complex entities like state machines modeling business processes. I'm fine with the simple entities using the database schema as the state definition. However with the complex entities I still find I run into the classic ORM problem where the database structure doesn't perfectly fit the domain's pure business logic representation.


Yeah for sure. That's kind of the idea with event sourcing in general, but I think with elixir/otp, it's easy to see how event-sourcing is only a persistence mechanism. Usually how I think about it is Event Sourcing (or gen server using a db as a recovery mechanism) "moves" the source of truth from a db into the process that owns the data.


I admittedly don't have your level experience, especially with such huge scale. However, I've worked with smaller systems and have found a "sweet-spot" that worked well in my cases:

1. Keep a traditional RDBM system. Place in here everything that needs consistency (e.g., the bank's accounts, balances and transactions) or has to be stored indefinitely.

2. That part of the system generates events. Those events are used to maintain Queryable stores that are better structured for your required queries. For instance, you could store per-client transaction lists in here. Or you could fill in an OLAP database, or whatever.

3. For each query-able store, implement a process that can populate it from the RDBMS data. This serves several purposes:

a) You can rebuild any such stores at any time. This removes the durability requirement from these stores, so they may be simpler and more efficient.

b) You can compare a rebuilt store against the current one. If there are any differences, there's either a bug in your event tracking code or a bug in the populator.

c) This consistency-check procedure is really useful during development and testing. When you make a change, it is really hard to get both the event-sourced store and the populator wrong but consistent with each other. This only happened to me due to mis-specified or mis-understood specifications, completely fencing out pure programming mistakes.

Of course, this system only works so long as your write volume can be ingested by your RDBMS. Luckily, this is the case on all of my current projects (and I would argue most projects out there). Notice that scaling reads should be much easier!

Finally, this architecture just accepts that read-errors may happen (e.g., a client doesn't see a transaction in their transactions list because some event got lost). This hasn't been a huge problem for us, since the mistake will be repaired on the next "reconciliation" process (along with a warning to get devs to investigate what happened). These reconciliations can be run as frequently or as sparsely as desired, or even on some user-initiated action (e.g.: the support guy has a button to "force reconciliation now" and instructions to press it whenever a customer complains about missing transactions).


Yeah I'm using kafka right now with small queues between processes (generally implemented as elixir/OTP gen_server). With a blend of RDBMS (postgres which is now a swiss army knife), Redis (also a swiss army knife), and zookeeper (mostly because I know it well and it's there with kafka. It's very useful for co-ordination across processes.)

Because the events coalesce into kafka, you can use a mechanism like logstash to spew this log data into eg elastic search and then use that for your queries. Or write to the db here (in process) or there (cqrs style.) There are different architectural approach used, and they all yield good results, of very high reliability in processing, if slightly "eventual."

It works well. We do use rdbms too, but a lot of the time we avoid reading from it after initialization (or after encountering an exception causing an actor to restart). Depends on the data though. Some places I read on every command because I know the risk front is small and it's easier for a jr to get in there and understand what's happening. A good rule is that a single process should own that data so you don't have to worry much about consistency related concerns. In microservices good practices, each service should really own its own data completely but I believe that it's fair enough to say a bounded context owns its own data (even if in a shared rdbms) if building at a smaller scale.

In terms of code, I'm usually building with some variation of onion architecture these days, where all context is build on the outside (eg receiving a request, getting info from state or db), and pure domain logic 100% effect free exists on the inside that receives all of the data from the outside and turns out some commands or events in response. This makes the important logic very testable, and easy to reason about. Core domain logic has no effects - not even logging - but only generates events/commands in response to commands and events. You never mock that stuff - it's the star of the show. Nor do you ever have to. The generated commands and events are later "applied" by the outer layer, logging, shipping them off to kafka, and/or writing some stuff to a database.

That's where I'm at today. I am working in realtime systems so it's a good fit for the approaches. I'm finding the software is turning out very well. We're working fairly quickly, the software is very encapsulated to a bounded context, the domain logic is clear in the core, easy to test and read, and change. Or even rewrite if needed.


It's really important to manage versioned schema for events and defining rules around evolution of schemas, hopefully with tooling support like via Avro's Schema Registry or Protobuf's forward and backwards compatibility guarantees.


Still like the idea but it is more complex to do it right and keep it right.

Would only do this with a good team and a real problem where this would be useful.

Rdbm goes very far


CQRS: "Command Query Responsibility Segregation"

Fowler's post is predictably excellent: https://www.martinfowler.com/bliki/CQRS.html


This is already cited and linked in the article (along with Fowler's article on event sourcing).


CQRS & Event Sourcing are good ideas, but putting yourself ON PURPOSE with eventual consistency is nuts.

Until your are in the facebook-big-data scaling challenging, you don't need the madness of lose ACID. You can get terabytes of data easily on a single server, and even partition by company, customer or similar that still allow to keep domain-level consistence.

Now, I put the events after my normal CRUD operations (ie: I emit CRUD + Save to event in a single transaction). Is super easy of operate and keep coding familiar and predictable

----

I think the ideal EE database engine must be like:

Have a log Table, and a index/subtables for each validation (ie: to check uniques, aggregates, counts, etc.)

So, if a have a customer related events, I have:

- Index on: code, name

- SubTable: code, name, isactive

all the other fields are not need for validation so are recovered from the log.

In ACID:

- POST Command

- Validate data with the index,

- Save to log

- Emit blocking events (events that need to be at the same time after the save)

- Commit

Eventual:

- Emit lazy events (events that not need ACID, like to fill external sources)


I'm glad somebody else said it. It seems lately a new meme is coming around (by many people independently) all saying:

Wait, event sourcing / immutable data doesn't scale.

I was doing event sourcing in 2010, and loved it. It was incredible. But by the time I started hearing people call it "event sourcing" I had already moved on to:

State-based, graph CRDTs.

They combine the best of event sourcing with the best of distributed state replication, and are super scalable!

Now even the Internet Archive[1] is running it (in 2014 I implemented it into a library that they are now using - https://github.com/amark/gun )

[1] https://news.ycombinator.com/item?id=17685682


They didn't even tell me what event sourcing is.


As I understant it:

Consider you have a database where you store the account balance.

If you want to update the account balance you might update the row for that customer, e.g.

tblAccounts ------------ | AccountHolderId | AccountBalance |

update tblAccounts set AccountBalance = @NewAccountBalance;

In an EventSource database instead you wouldn't update the AccountBalance column. You would store something like:

AccountEvents | AccountHolderId | AccountBalance + 100.00 | AccountHolderId | AccountBalance + 150.00 | AccountHolderId | AccountBalance - 80.00

Then ifyou wnat o get the current balance you can just take the opening balance and add 100, add 150 and subtract 80.

Periodically you need to collapse these as querying for the balance could end up requiring going through a large log of events. So you snapshot at some point in time. So assuming an opening balance of zero, we could snapshot the above to 170.00.

It feels like you get auditing / logging of all changes out of the box, also if you are working in functional programming or a system where you constrain side / mutability as much as possible you sort of eliminate your db as a giant mutable object. But you also get the downsides this article talked about.


One would expect all DB-based live accounting systems to store balances as a series of transactions? Even if you have a view that sums over that or a regular process to collapse old transactions, there's really no other sensible way to do it. Concurrent updates to the same row will always cause problems.

...Having written the above, I just examined the ERD for the COTS customer system in the office I'm in now. It stores not just balances but aged balances directly in the main customer table. Good grief.


Account balance is a bad example, because it cannot be solved using event sourcing. A key to event sourcing is eventual consistency. The account balance cannot be eventually consistent, otherwise it will allow double-spending. It has to be immediately consistent.


Not if the event sourcing aggregate rehydrates on the request to spend, and sends failure or success back.


Only store the events, and derive the current state by chronologically reducing the events.

That's all Event Sourcing is.


It's in the second paragraph. But I agree that it's a terrible name.


Earlier names for the same include prevalence and intent-logging.


In a nutshell, instead of storing a current state of the data in a database, you store the "event", i.e. the information about the data being changed. You can also additionally save the current state, either at the time the event is saved or independently, but that simply acts like a cache when reading data; otherwise you would have to replay the whole events history to get to the latest state.


Just think of it as double-entry bookkeeping. It's been around for a while. :) The events that we record become our source of truth.


> think of it as double-entry bookkeeping

just think of it as bookkeeping, period. a ledger of all events. absolutely nothing to do with double entry (which for accounting; from wiki: "The double entry has two equal and corresponding sides known as debit and credit.")


Sounds like OLTP transactions and aggregated state only with hip names.


Guys, just need to say that this article is not mine!


If we choose to build a business critical functionality around this eventual consistency can have dire ramifications. There are use cases that availability is the needed property of a system but there are also use cases where consistency also is, where is better to not make a decision rather than making it based on stale information.

I see this issue raised quiet often. If consistency is paramount you can make commands on certain aggregates be synchronous all the way to updating the read model. There's nothing that says you MUST have a queue in-between the event log for an aggregate and the logic that updates the read model. Use common sense.

Typically shines the most when pinpointing parts of the system that benefit from it, identifying a specific bounded context in DDD terms, but never on a whole system.

I feel like Greg Young has taken great pains to make this clear. This should be taken for granted when attempting CQRS.

Also your events will be based on a SomethingCreated or SomethingUpdated which has no business value at all. If the events are being designing like this then it is clear you’re not using DDD at all and you’re better of without event sourcing. Finally, depending on the requirements on how the synchronous the UI and the flow of the task is the eventual consistency can, and most of the times will, have a klinky feel to it and deliver a poor user experience.

If the read and write model are being updated asynchronously from the UI you're gonna have to adopt an optimistic caching scheme on the client. This is why GraphQL subscriptions are pretty much boilerplate for any client I build against a CQRS service. The Apollo client seems to handle this rather well.

Converting data between two different schemas while continuing the operation of the system is a challenge when that system is expected to be always available. Due to the very nature of software development new requirements are bound to appear that will affect the schema of your events that is inevitable.

I hereby give you permission to use the Strategy Pattern. Problem solved.

The events can’t be too small, neither too large they have to be just right. Having the instinct to get it right requires an extensive knowledge of the system, business and consumer applications, it’s very easy to choose the wrong design.

Greg Young and others have talked quite a bit about how to bound aggregates.

However the events in a event store are immutable and can’t be deleted, to undo an action means sending the command with the opposite action.

This is why bookkeeping systems have the idea of "journal entries". I haven't implemented one for an event sourced system but I can see how this might work.

Overall great post. Really enjoyed that the author took the time to walk us through all of these issues. Most are non-trivial.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: