We've moved down this path from a massively complicated distributed transaction environment on top of MSMQ, SQL Server etc and you know what? With some careful design and thought about ordering operations and atomic service endpoints, we didn't need them at all after all.
Transactions can be cleanly replaced with reservations in most cases i.e. "I'll reserve this stock for 10 minutes" after which point the reservation is invalid. So a typical flow for a order pipeline payment failure would be:
1. Client places order to order service.
2. Order service calls ERP service and places reservation on stuff for 10 minutes.
3. Order service calls payment service (which is sloooow and takes 2-3 mins for a callback) and issues payment.
4. Payment service fails or payment fails.
5. Order service correlation times out.
6. Order service calls notification service and tells buyer that their transaction timed out and cancels the order.
7. ERP service doesn't hear back from the order service and kills reservation.
etc etc.
At step (4) you have an option to just chuck the message back on the bus to try again after say 2 minutes. If everything times out, meh.
Thanks for this! It seems like a very interesting pattern and I was completely unfamiliar with it prior to reading your comment. Looks like a search for "reservation pattern" gives lots of good places to start digging, but I'm wondering if you have any favorite resources on the subject. Is there a good treatment of it in some particular book? Or maybe presentations you've found particularly enlightening?
ERP call is atomic and happens first. All that says Is "can I have 2kg of potatoes for 10mins please?" If that returns a success code then you can process the payment.
Also if the ERP says "in ten days you can have that amount of potatoes" you can ask for a longer reservation and issue the payment later.
Its all about careful ordering and atomicity at the service level and determining what must be done synchronously and what can be done asynchronously.
I was interpreting the parent poster's question to mean:
1. reservation is placed.
2. Payment succeeds, but either success is not known, the process requesting payment crashes before the response, etc.
3. ???
Since we never got to telling ERP "hey, that reservation will be permanent because the payment succeeded", but the payment succeeded… what do you do? Does the reservation expire (but my potatoes!)? How do you even know that the payment succeeded, if perhaps a network connection goes dark and requires 2h to fix?
In this case it's not really any different than other distributed transaction systems... another process (potentially manually) has to review, and correct things...
What happens when your payment processor succeeds in processing the transaction, but you don't get the success code? You either retry/confirm/correct... One would assume you would, upon not getting confirmation that your reservation was made permanent, retry the commitment, if it was already committed, then the erp service can return the appropriate response.
I missed the confirmation step above. That would happen once the payment has been correlated.
At step 3 in your list above the payment would time out and a refund would be issued. Usually payments time out as well so you can usually reserve cash (pre-auth in banking terms). So we end up with stacks of reservations.
If something breaks you can retry within a reasonable limit or wait for everything to drop all the reservations.
Transactions can be cleanly replaced with reservations in most cases i.e. "I'll reserve this stock for 10 minutes" after which point the reservation is invalid. So a typical flow for a order pipeline payment failure would be:
1. Client places order to order service.
2. Order service calls ERP service and places reservation on stuff for 10 minutes.
3. Order service calls payment service (which is sloooow and takes 2-3 mins for a callback) and issues payment.
4. Payment service fails or payment fails.
5. Order service correlation times out.
6. Order service calls notification service and tells buyer that their transaction timed out and cancels the order.
7. ERP service doesn't hear back from the order service and kills reservation.
etc etc.
At step (4) you have an option to just chuck the message back on the bus to try again after say 2 minutes. If everything times out, meh.