Saga Pattern Made Easy

kgeist · on May 30, 2023

At our previous project we were trying to implement our own sagas and I remember that compensating actions are the hardest to get right. For example, we had a saga which added participants to a meeting as one of the steps. If the saga failed, we had to remove participants from the meeting. The module/service which implemented meetings was independent/self-contained and had no knowledge of sagas or anything. So while a participant could be added by that particular saga, they could also be added by other means, because there were other entrypoints to the meetings service. So a failing distributed saga undoing its actions could remove a participant who was to be added to the meeting outside of the saga. I don't remember if we ever solved it, and what is the proper way to deal with it. Basically, a distributed saga can introduce what looks like strange side effects to other workflows running in parallel, especially if the service is designed to be self-contained without knowledge of sagas (so participation is not marked as "added via saga, can be deleted any time due to compensating actions"). Maybe add some sort of reference counting, but then again we add knowledge about sagas to the service, which is not always possible. Another issue is notifications, you send your user an email "hey, you were added to this meeting", they follow the link, and it says "access denied", because a saga rolled it back.

azurelake · on May 31, 2023

The UX of the notification case could be reasonably solved by either having the notification being send as the last step, or sending a notification that they were removed with the reason if that’s not possible.

What did the API of the meeting service look like specifically?

mrkeen · on May 30, 2023

I've never found myself coding undo actions, because it seems that if a forward step can fail, so can a backward step. And now you've got two things to debug.

anyfoo · on May 30, 2023

"I've never found myself packing a backup parachute, because if one parachute can fail, so can its backup one. And now you've got two things to worry about."

mrkeen · on June 4, 2023

Correct. I stay on the ground.

paulddraper · on May 30, 2023

You must for transactions (like, financial transitions).

Let's say you're making a plane booking service (kiwi.com clone).

You charge the customer, then book the flight. But if booking the flight fails, you must refund the customer.

mrkeen · on June 4, 2023

I'm not saying it's not necessary; I'm saying it's not sufficient.

    try {
       pay();
    } catch {
       try {
          refund();
       } catch {
          // "must refund the customer" implies we can't reach here
       }
    }

lorendsr · on May 30, 2023

At some point, if you can't automatically fix something, you have to stop and report to a human for manual intervention/repair. While a saga doesn't guarantee that you avoid manual repair, it significantly reduces the need for it. If each of these has a 1% chance of non-retryable failure:

Step1

Step2

Step1Undo

then this has a 1% chance of needing manual repair (it's okay if step1 fails, but if step1 succeeds and step2 fails, we need to repair):

do Step1

do Step2

and this has a .01% chance (we only repair if Step2 and Step1Undo fails, 1% * 1%):

do Step1

try {

  do Step2

} catch {

  do Step1Undo

}

nivertech · on May 30, 2023

There is also the case when Step1 was successfull, but the Saga Orchestrator (or Saga participant in case of Choreography) for some reason (like communication error) doesn't know about it.

In case Step1's service doesn't expose an API to poll its status, then the only recourse is to execute it again (with the same input key, assuming it's idempotent ;)

azurelake · on May 30, 2023

There's nothing to debug because a failure during a saga is a totally reasonable and expected thing to happen. Take the example in the article.

1. You book a flight. You successfully reserve a seat.

2. You book a car. You successfully reserve a sedan.

3. You try to book a hotel room. The room that you wanted was booked while you were booking your flight, and there aren't any more available.

You obviously don't want the car or flight anymore, and you want to cancel them without a human having to manually fix it.

richdougherty · on May 30, 2023

I think mrkeen is talking about a failure when handling a failure. E.g. when a cancellation step fails, what do you do?

The answer is, you model those as well and work out what to do. But it's more messy than you might think if you just model the first-order failure paths.

azurelake · on May 30, 2023

I disagree, and here's why. There are basically two reasons a cancellation step would fail.

1. A misunderstanding of the business rules. In the flight example, you thought that were flights were cancellable, but actually the airline only offers nonrefundable seats.

2. System type errors, e.g. network outages.

If you get a type 1 failure, that's an error that gets ingested in your error monitoring service, and is a bug that needs to be fixed. If you get a type 2 failure, idempotent cancellation (which is necessary for this work) will eventually get you to your desired state. Either way, you shouldn't need to model deeper into the state graph.

mrkeen · on May 30, 2023

> If you get a type 2 failure, idempotent cancellation (which is necessary for this work)

That would have been a good article. The saga pattern could have just been a footnote to it.

fortunaTemporal · on May 30, 2023

Here's the mind-blowing thing: Temporal handles those type 2 failures for you. So they are the footnote, and then the saga pattern can take up the whole article

sublinear · on May 30, 2023

You should be able to abort everything at any time and still revert to the old state regardless of external service failures. Even if the database went down you have the initial state queued to be restored when it's back up.

Instead of untangling the mess, just cut the gordian knot and throw a nice error of what failed and what was aborted.

fortunaTemporal · on May 30, 2023

But in the scenario that the Saga pattern handles, you have at least TWO databases, and multiple processes can be modifying them in the meanwhile. It IS a gordian knot and you don't have a known clear place to restore from.

mrkeen · on May 30, 2023

Example?

segmondy · on May 30, 2023

Your undo operations will be very simple.

So instead of having a complex logic. Have a simple lambda function that talks to a queue. That's it. It accepts an undo command. You read a command you stuff it in the queue done. No DB, No servers. If you were running this yourself. You will have a simple API (distributed) that does the same to a distributed queue/cache. Done.

Your complex job can now pick up the undo commands from the queue and execute with logic to retry if for some reason it fails.

joesb · on June 1, 2023

It's not completely about handling unplanned failure, but handling alternative path when the condition for one path is not met. For example, when you perform `withdrawMoney()` it can fail because there's not enough money in the account. This has nothing to do with your coding failure.

If you have if/else in your code, you don't think "If one path fail, so can the other path, so I never handle the other path".

sublinear · on May 30, 2023

Undo actions are not necessarily about mitigating failure, but just getting to another state in the state machine. Not all failed forward steps are bugs nor always need a retry before an undo.

Regardless of whatever happens, a failed transaction state should always be possible without affecting data integrity.

liampulles · on May 31, 2023

The undo will hopefully catch the slim case when a rollback is needed. If the undo fails (slim slim case) then you flag for a human.

It's just an act of trying to automate the resolution of error scenarios to reduce human effort.

koromak · on May 30, 2023

I think in some situations you have to try.

Billing is one, for example.

shoo · on May 30, 2023

It's good that this article discusses idempotency.

Suppose temporal is correctly configured and operated, and gives you reliable and robust workflow execution. Your custom workflow executes code that attempts to perform operations on your other services over the network, via those services' APIs. Temporal itself doesn't know or model anything about your API calls, or your other services. Temporal uses the abstraction of an "activity" to wrap operations with side-effects, such as API calls to your services that workflow might need to make. Temporal can guarantee that it will keep re-trying to execute an activity until it succeeds, but this means that the code in your activity may get executed multiple times.

If you want each state-changing operation in your workflow to execute exactly-once, so your overall system has exactly-only execution semantics, it is your responsibility to ensure that the APIs exposed by your other services and used in your workflow are idempotent, and that they are called correctly from the workflow, passing an idempotency key that does not vary when a temporal "action" wrapping an API call is retried.

For another interesting (temporal-agnostic) discussion of how to build a system achieving similar 'exactly once' behaviour, there's a good infoq podcast interview with Jason Maude: https://www.infoq.com/podcasts/cloud-based-banking-startup-j...

mkleczek · on May 31, 2023

A long, long time ago we used to have two-phase commit protocols and XA. I am afraid most of sagas are 2PC in disguise - just requiring a programmer to explicitly implement rollbacks.

The situation is similar to the whole "database inside out" hype that drives application programmers to re-implement proper DBMS.

I am pretty sure the IT hype wheel will turn around and someone will sell 2PC as a new shiny thing.

lorendsr · on May 31, 2023

2PC tends to have limited throughput due to the participants needing to hold a lock between the voting and commit phase, and all the participants need to support the protocol. Sagas work across different services and data stores and can have high throughput.

However, if all of the data you need to update is in a single database that supports atomic commits, I'd go with that over sagas.

mkleczek · on May 31, 2023

> 2PC tends to have limited throughput due to the participants needing to hold a lock between the voting and commit phase

Scalability depends on lock granularity, what's more...

> Sagas [...] can have high throughput

There is no real difference as sagas in practice implement locking in disguise - take the scenario of flight/hotel/car booking:

Once you book a hotel - this particular resource (a room at a particular time) is effectively locked. Cancelling the booking (because other participants failed in saga) is effectively releasing the lock. The room (resource) is locked for the duration of the whole process anyway (as no other customer can book it during this time).

The downside of sagas is that a programmer is forced to explicitly handle all failure scenarios - which costs development time, is error-prone etc.

liampulles · on May 31, 2023

Ask a first stab, I think an alert/ ticket is sensible for an activity undo operation. Deal with it a couple times manually and discuss with business and then you can implement a decent undo operation (with an alert if that fails).

heywhatupboys · on May 30, 2023

Is this ACID? getting so used to databases being the only true failsafe rollback systems, that programming business critical components in things like Java seems scary. We must need at least an event queue!

anyone got some insight? is this just a toy?

lorendsr · on May 30, 2023

Sagas are for when you can't do an update in an ACID transaction, for example when updating state across different types of data stores.

If you're asking whether the catch clause in a Temporal Workflow saga is guaranteed to execute, the answer is yes. The way it's able to guarantee this is by persisting each step the code takes and recovering program state if a process crashes, server loses power, etc. For an explanation of how this works, see: https://temporal.io/blog/building-reliable-distributed-syste...

agumonkey · on May 30, 2023

Seems like a distributed stack unwind

fortunaTemporal · on May 30, 2023

Author of the article here. Yeah, @agumonkey I like that analogy a lot!

Generally, that's a good way of thinking about it. The one additional bit of nuance is it's like a "safe" stack unwind while other processes could be still modifying databases at the same time, so it's not a complete "rollback" of the whole world if that makes sense.

agumonkey · on May 30, 2023

Thanks, you're probably right, there's more to it. It's a thrilling topic, are there any other patterns, or abstractions into controlling "distributed state" (apologies if i'm twisting things too much again) between agents to keep things in a correct order ?

fortunaTemporal · on May 31, 2023

One pattern is having a Workflow that runs as long as the lifetime of a domain object and holds a conceptual lock on that object—it receives requests to modify the object, and makes sure to only perform one operation at a time. (like the state pattern on a particular agent)

Also related: Signals are events that you can send to Workflows and between Workflows, and they’re always delivered in the order they’re received.

More generally, for a handy reference of Distributed Systems patterns, check out https://microservices.io/patterns/data/saga.html (though I personally find his diagrams a bit...overwhelming) and the MSN writeups: https://learn.microsoft.com/en-us/azure/architecture/pattern...

agumonkey · on May 31, 2023

oh that's cool, thanks a lot

lorendsr · on May 30, 2023

If I'm getting your point right, I agree! If you have the workflow / durable execution primitive to depend on (a durable function is guaranteed to complete executing), then there are a lot of pieces of distributed systems stacks that you no longer need to use. Your durable code is automatically backed by the event sourcing, timers, task queues, transfer queues, etc that Temporal internally uses to provide the guarantee, so that you don't need to build them yourself.

azurelake · on May 30, 2023

Nope, it's not a toy. It's a fork of Uber's workflow engine, Cadence.

You don't need to explicitly interact with an event queue because it's a higher level abstraction that sits on a queue. In fact, that’s the big value add.

ACID doesn't help you once you're trying to coordinate actions across systems, like the example in the article.

mrkeen · on May 30, 2023

> You don't need to explicitly interact with an event queue because it's a higher level abstraction that sits on a queue. In fact, that’s the big value add.

The event queue is an even bigger value add:

* It's the audit log

* A human reading the event queue has final say over the 'true state' of the system (insofar as such a thing exists in a distributed system)

fortunaTemporal · on May 31, 2023

Temporal has an event queue under the covers, so you don't have to implement (and debug) that yourself.

nivertech · on May 30, 2023

Not sure about Temporal's implementation, but classic Saga pattern is ACD (without Isolation), so overlapping or concurrent Sagas may overwrite each other.

There are several Saga-related application-specific patterns called countermeasures, which help to somewhat mitigate this problem.

Also unlike RDBMS with their ACID transactions, Saga design forces you to understand your business requirements better. Which steps are more likely to fail? Which steps are pivotal (i.e. points of no return)? Which steps are riskier or more valuable for business, etc.

fortunaTemporal · on May 30, 2023

Agree with what other commenters have said. Also wanted to say that with Temporal you can code Sagas in Go, Java, TypeScript, and PHP, and they're working on expanding to other languages as well.