Hacker News new | past | comments | ask | show | jobs | submit login
Differential: Type safe RPC that feels like local functions (differential.dev)
77 points by interpol_p 10 months ago | hide | past | favorite | 138 comments



Remote procedure calls (or whatever it is called) simply can never really feel like local functions. It has been tried again and again for decades.

Maybe call it "removes as much friction as possible compared to local functions" but please don't call it "feels [exactly] like local functions". It's simply impossible and pretending so will just cause trouble.


And even then, I no longer believe that "removes as much friction as possible compared to local functions" is a desirable goal in the first place.

Over the course of my short 20-year career, I've encountered far too many applications that accumulated terrible performance or reliability problems that were directly attributable to their authors feeling a bit too comfortable about getting chatty with the slow, unreliable remote procedure calls.

By contrast, the most successful microservices shop I ever worked at was very conscientious about doing nothing to hide the messiness of RPC. (Ease the implementation, yes, but not hide the messiness.) And that seemed to foster a culture of being much more thoughtful about interaction patterns and data flows and how they might contribute to or detract from the overall performance and stability of the system.


One of the author here. I definitely agree that you get markedly lower reliability and performance one you move a local function call to be remote. So, we're definitely not advocating that moving a local function call to be remote is inherently virtuous.

However, I've found the abstraction to be useful when someone _has_ already made the decision to refactor their service out to be a separate service.


I'll disagree with that. In fact, I'd argue the vast majority of "local" functions are no better than RPC calls and shouldn't be treated any different.

Sure, `sprintf` will run synchronously and finish in known time, but `read` may as well end up waiting for seconds for an HDD to spin up, or for a mounted NFS filesystem to respond. `read` is not any different from an RPC call. And, in fact, `read` can fail in just the same ways as an RPC call can. So can `write`, which is why `sync` exists.

Software ignoring these potential issues, and pretending they don't exist, is the source of many unnecessary software hangs and crashes, such as Windows Explorer crashing if your SMB share responds too slowly or Linux being unable to unmount an NFS share if the connection breaks badly.

The solution IMO is that we should finally stop hiding asynchronicity behind synchronous facades. If coroutines, promises, async programming, etc are first-class features, it becomes much easier to cleanly model these potential errors and in turn handle them correctly.

And once you have that, an rpc call

    async fn rpcMakePurchase(user: &User, order: &Order) -> Result<Purchase>
does feel just like a local function

    async fn read(path: impl AsRef<Path>) -> Result<Vec<u8>>


Actually, I think we are on the same side. :-)

There is still a small difference though: when I execute a write and sync, then I'm pretty sure it either succeeded or didn't. I'm not sure if there is any way for the disk to say "don't know" which the OS will then pass to my program. Even if, I guess such a case is so rare that it is probably excluded in any error-model. Like cosmic rays. And I can always retry, since it's very very unlikely that there is a connection issue only between my program and the disk but not between other programs and that disk - since they run on the same machine.

Over a network things are different, because the connection might be gone for a long time and know I'm not aware of the state and I can't check it. I can also not tell if the connection might come back, the machine I was talking to might have burned down. If I, myself, burn down, then I don't need to worry anyways.

That's the difference. This is specific for write/sync though - as you explained it, just because things are running on the same machine does not mean things are necessarily more reliable than a "remote" call.


Why does there have to be a difference?

When modeling rpc calls as async calls returning a result, you can be just as sure that it has completed (or failed) as with a local call.

And considering write can use an underlying mounted network share, a spun-down HDD, or a thumb drive on a failing USB port, the same failure modes exist in either case.

In either case you're just submitting commands to another device's queue (and waiting for a result) over a hot-pluggable connection.


Because either the disk is down for everyone on that machine or for no one. That it is only down for my program is theoretically possible (e.g. a bug in the OS or driver) but it's so unlikely that it's usually moved into the "cosmic rays" category.

This is not true for networks (the internet).


I don't disagree with that, but does that matter for your program?

Your code still has to handle the same failure modes regardless.

I've had many situations where e.g. reading locally cached files on a smartphone eMMC had worse latency and throughput than doing the same read over the network on the server. The cache actually made performance worse in every way.


Yeah it matters. It's the reason why e.g. postgres offers data consistency and still operates in parallel on the disk.

If it were the same doing this over a network, then we could have a distributed, consistent postgres. What would I give for that. :-)


Well, I wouldn't use that as an example, as Postgres actually misinplemented write/sync in the past causing data corruption ;)

Regarding having a distributed postgres, that's not an issue actually. As long as all writing workers have access to the same synchronization primitives you can easily use postgres with e.g. k8s volumes in single writer, multiple reader mode. And single master, multiple read replica postgres deployments are common too.

The guarantees you're talking about aren't given by the storage implementation, but by the fact that all write workers run on the same machine.


There is no distributed postgres that behaves like a single one (except for being a bit slower sometimes or do).


All I claimed was that, if you let postgres run against network mounted storage it works the same as if you used it with local storage.

Do you see what I meant?


Sure, but what's the point? There still is no distributed postgres. there are read replicas, but that's not the same thing.

So why do you think that is?


I dunno what the point is supposed to be, you brought it up


Yes. More on that from the creator of Erlang:

http://armstrongonsoftware.blogspot.com/2008/05/road-we-didn...

Bye, Joe.


One of the authors here.

We didn't know we were going to be posted on HN, and we're still iterating on copy that "lands". I don't think we've found the words yet.

I agree with the premise that RPC has markedly different characteristics. The actual implementation [1] makes it clear that you're connecting through a client interface.

Thanks for the feedback on calling it "removes as much friction as possible compared to local functions". I think we'll trend this way.

[1] https://docs.differential.dev/getting-started/quick-start/#3...


I once wrote a system for distributed data processing (8-10 years ago). There was a base class that when inherited from would, via Interception reroute a call to any virtual method to a server with the same interface loaded (plug-in architecture). The server would then delegate the method to one of its worker clients, which had the (same) required plugin(s) loaded. It worked over TCP on a local intranet. Once the admin-server-worker plugin system was figured out, projecting a virtual method was fairly transparent. There's always going to be some boiler-plate!


I suspect parent was referring more to the runtime characteristics of latency, error handling, retries, load balancing, etc… more than the syntax.

Nonetheless, what you built sounds quite interesting: can you say anything more about the language, tech stack, etc…?


Not OP but I too have built this, basically a few hundred lines of python to implement something like json rpc (plus introspection to answer help requests / publish typed api info) on top of AWS lambda.


I had to replace a .NET remoting based system with thousands of individual calls, with all manner of in process like conventions, with an http layer without changing client code at all in a few weeks.

The result was good but I still don't like transparent rpc (on large dev teams anyway).


There's an exception to this rule if the underlying network does not go wrong.

In particular, if the "network" is something local to one machine, pcie or some cross-chip fabric thing, where the failure modes of the "network" are the box falls over, then RPC works as one might hope. You don't have to mangle functions to take account of extra network errors because there are none.

The specific use case I'm interested in is heterogenous compute on shared memory machines. Code running on a GPU can call `int rc = fprintf(stderr, "oh dear");` and have no idea whether that runs code on a CPU or not. Code running on a CPU can call some linear algebra function and either run it locally or on a GPU, unobservable to the caller.

That is a good trick. It does make for an interesting debugging experience where you lose track of where code is executing.

----

There's a less esoteric application in the area of the OP for pure functions where you don't need the result immediately. That goes something like:

1. Call the function and get back a promise or similar. All runs locally, no failures

2. Something in the background maybe does the work for you

3. Ask for the result. If it's not available yet, do the work locally

That has the same zero failure modes property because it'll ultimately run all the work locally if the network offloading never works out. It only really makes sense if the functions can be called multiple times (e.g. locally and remote at the same time) without doing harm, i.e. pure functions. It costs latency and gains throughput. For some batch computation work that is great.


Won't the proposed implementation risk doing duplicate work? e.g. another machine starts on your work, but then you ask for it before it's done.

So you'd probably need the "not available, but already 'stolen' case". For example, the current machine could look for work to steal itself. But this probably will increase latency a lot, so it would only make sense if the batches are huge.


The sketch above makes duplicated work and inefficiency very likely. It's explicitly choosing at-least-once execution.

The problem with the "already scheduled" pattern is when the machine holding the task falls over.

Not a good choice for bank transactions. A good option for something like a distributed build farm - it doesn't matter hugely if you compile some piece of program repeatedly, but it matters a lot if you send the only copy of some source to a machine that immediately goes offline.

Mainly bringing it up as a case in which RPC works beautifully - remote machine failures can be completely hidden by a willingness to do the work locally instead.


It's really hard to have exactly-once delivery[1].

But exactly once processing is viable.

With Differential, the implementation does allow for idempotency in cases where you need exactly-once processing. And it does handle the "not available, but already 'stolen' case".

Disclaimer: I'm an author.

[1] https://news.ycombinator.com/item?id=34986691


> There's an exception to this rule if the underlying network does not go wrong.

Fair enough. Maybe we shouldn't call it remote procedure calls then, because it's not remote anymore. The word remote exists to make the distinction compared to local. If there is no distinction in a given context, there is no need to use that word from the beginning.


Maybe? Say one processor is an aarch64 core and another is x64, but someone has put them on the same memory subsystem and the system as a whole tends to die if either falls over. And then you want to make a function call from one to the other - it isn't as remote as another machine, but you can't pass things in registers either. "Remote" call seems a reasonable description to use for that but as the semantics are indeed different perhaps a new term should be used.


>There's an exception to this rule if the underlying network does not go wrong.

There is another big restriction. The sender and receiver have to be in the same process. Else it's possible that the receiver does not exist and there is no way to start the receiver back up.


Agree. Hard to abstract away a 3-4 orders of magnitude latency difference. However, even the most obvious /absurd / ugly shell out to curl and hand parse results like code, is going to get wrapped in a normal looking function. Maybe the solution is naming or tooling where: 1) it is easy for a developer to see if a network call exists anywhere in the stack and 2) it isn't possible to introduce a network call into an existing function without breaking contract. This could be as simple as a viral naming convention ("_rpc" suffix for example).


One of the authors here. Thanks for the insight.

I generally agree with the fact pretending it's like a local function will cause trouble down the track. After all, you're introducing a number of extra failure modes.

The client library however does give some protections to the developer through the type system - although it's not perfect.

Hence the wording "feel like". It's not actually local function calls. But it abstracts away a lot of the complexity.


Try DCOM


Yeah, exactly. DCOM was a failure for the same reason that every other RPC thing is a failure. It turns out that introducing cascading failure or partial failure opportunities due to synchronous remote calls was never a good idea. It was always, and still is, treating the fallacies of distributed computing as first principles.


"Predictive Retries: With the help of AI Differential detects transient errors (database deadlocks) and retries the operation before the client even notices. See Predictive Retries for more." [https://docs.differential.dev/]

"If it's predicted to be transient, Differential will retry the operation on a healthy worker before the client even notices." [https://docs.differential.dev/advanced/predictive-retries/]

I… er… wat? Does that mean this behavior is nondeterministic, driven by what some extra component cooks¹ up on a per-call basis?

Doing this without developer input is such a horribly bad idea, and then debug it when something (not even in the RPC layer) goes wrong?

[Edit:] "You can turn this feature on for your cluster using the Console. It's off by default." – a breath of relief.

--

¹ I really had to resist the temptation to say "hallucinates" there.


One of the authors here.

So, the implementation _is_ deterministic per observed error, that is because it caches the previously cooked up result of "is this retryable or not" and apply it to next observed error without trying to process it for each function call.

Whether it does correctly infer retry-ability - well, that's another question.

It's an opt-in feature, and it's not default-on. So, it does do this _with_ developer input. If you prefer to not use this abstraction and have total control over error handling, you definitely can do that. (And it's the default)

If I reduce the tradeoffs here, it's between: 1. Explicitly handling all your errors for complete control of retry-ability. 2. Using predictive retries when you want some "good enough" error handling.

I agree that we can't really be 100% "correct" in determining whether something is retryable. But I think it's a bounded enough problem for a language processor (ala LLM) that can yield good enough results for vast majority of the use cases.

Here's a test case that asserts this behaviour (without any caching): https://github.com/differentialhq/differential/blob/236ffc53...

So in a way, it is bounded enough to yield this result deterministically in our test suite. But I agree there's a non-zero chance it fails, although we haven't observed it yet after thousands of test runs.


> It's an opt-in feature, and it's not default-on.

The documentation sounds like it's a global switch though? This is really a per-function property…

> But I think it's a bounded enough problem for a language processor (ala LLM) that can yield good enough results for vast majority of the use cases.

If as a developer I want this feature I'm not sure why I wouldn't just stick a "retry" annotation somewhere in the function declaration (schema file, export, whatever.) I can see the utility of having the RPC layer handle this for me, particularly for pure compute functions. That's "good enough" error handling: the developer correctly identifying which functions are retryable, erring on the side of caution, and getting automatic retries for those (and just those.) Other RPC errors bubble up.

What benefit AI gives here — especially in this form, hidden in the RPC layer — I don't see at all. If I want AI to help me identify which functions are retryable, I'd rather have it do that in my editor by visibly sticking the "retry" attribute on the function. That way I can fix it if it mis-guesses.

Also: what do you do if with your RPC implementation the guess is wrong and a function is retried when it really shouldn't be? What's the fix?


> The documentation sounds like it's a global switch though? This is really a per-function property…

It is a global switch per cluster, yes. Agree that it should be a per-function property, and we're working towards that. It's currently in beta.

> What benefit AI gives here — especially in this form, hidden in the RPC layer — I don't see at all. If I want AI to help me identify which functions are retryable, I'd rather have it do that in my editor by visibly sticking the "retry" attribute on the function. That way I can fix it if it mis-guesses.

I think you're mistaking "identifying functions" with "identifying failures". The AI does not identify which functions are retryable. That's your job as a developer. The AI helps with figuring out which *failures* are retryable.

You can't potentially account for all retryable cases in your code. Some exceptions happen in underlying libraries, database drivers and OS level that you might not have see before.

ECONNRESET? Yes ECONNREFUSED? Yes 429? Yes 400? No DatabaseDeadlockError? Yes PGPoolConnExhausted? Yes

> Also: what do you do if with your RPC implementation the guess is wrong and a function is retried when it really shouldn't be? What's the fix?

Good question. If there's a function that you definitely can't afford to retry, there's an idempotence helper [1]

[1] https://docs.differential.dev/advanced/idempotency/

edit: typo


Woof, yea that sounds really, really dangerous.


Usually type safety isn't the issue, rather pretending the network doesn't exist.

RPC, even if running on the same machine, is distributed systems land.


This might be an interesting project. "Type safe RPC" is not a great description to lead with as remote procedure calls have a large amount of negative associations that have little to nothing to do with static type systems.

I think it's a distributed language runtime where you write functions in typescript and the runtime deals with passing data between machines.

The website is quite good at presenting the system as a nice thing that does nice things and you should like it and it's all very nice. It does a very poor job of convincing me that it works properly. I'd want to know how the scheduler works and how it handles mutable state in the presence of network partitions.

Also just found a caveat which amounts to they've got the implementation wrong:

> Arguments must be JSON serializable. This means that you can pass strings, numbers, booleans, arrays, objects, and null. You cannot pass functions, promises, or other non-serializable objects


Author here.

> It does a very poor job of convincing me that it works properly. I'd want to know how the scheduler works and how it handles mutable state in the presence of network partitions.

Thank you. This is valuable feedback, and will definitely work this in to the docs.


It's probably an audience thing.

The current pages are a strong sales pitch to semi-technical management. Look at all the things you don't have to implement!

I read this looking for indications that you've built the complicated thing correctly. Do you understand the maths involved, have you implemented it in a fashion that inspires confidence.

A distributed language runtime that works properly is a valuable thing. One that kind-of works when held carefully is a liability that people would regret trying to build on.


What do you mean by “they’ve got the implementation wrong”?


> You cannot pass functions, promises, or other non-serializable objects

Functions are serializable. And deserializable. You can absolutely take one defined on one machine, kick it over the wire, then execute it on another one. Likewise closures, you serialise the associated state as well.

Promises are the same thing - a distributed language runtime should be exactly the sort of graph execution system which lets you pass promises between machines and have things work out.

Further, they don't say what other "non-serializable" things are. Is a DAG serialisable without spuriously unrolling it into a tree and failing to fold back into a DAG at the other end? How about a graph? Given that all things are serializable it is difficult to guess what things they haven't implemented.

So by "they've got the implementation wrong", I'm saying they haven't implemented the tricky part of serialisation, only the copy ints over wires part. And by "wrong", I'm saying that a language runtime dedicated to transparently running typescript functions on other machines should be able to run those functions on other machines even if they have closed over state.


Author here. Absolutely agree with what you're saying.

Maybe the line could be written better. It is supposed to say: You can't pass functions and promises. And you can't pass in other data types that are not serialisable.

I definitely did not intend to say that functions or promises are not serialisable, and it is something we're looking into. The complexity is with serialising the closure state.

The other thing we're implementing is constraining the data types via the type system itself.

edit: punctuation


All data is serialisable. I'd say that's axiomatic. Data is information with a concrete representation.

At present the docs say it can serialise things that it can serialise with a couple of examples on each side.

My best guess is there's some limitations caused by the stack under your implementation that you haven't worked around, but I don't know what those limitations are or which ones you plan to resolve.

In addition to what can be sent there's a load of design choices about what is sent.

Is aliasing within the original preserved?

Is potential aliasing introduced?

Is it compressed in transit?

Is it delta encoded with respect to information that was sent previously?

Is there some initial blob of information every instance starts up with that messages are deduplicated against?

And that's before the behaviour on mutation - when one machine changes the data, how and when is that change reflected on other machines?


> All data is serialisable. I'd say that's axiomatic. Data is information with a concrete representation.

I Do not disagree. I think the problem comes from us using the term serializable to liberally.

> My best guess is there's some limitations caused by the stack under your implementation that you haven't worked around, but I don't know what those limitations are or which ones you plan to resolve.

Agree. Feedback received, and thanks. This will be worked in.

But I'll attempt to answer your questions here:

> Is aliasing within the original preserved?

No. The data being sent over the network is copied by value, not reference - if that makes sense.

> Is it compressed in transit?

Yes. With Message Pack.

> Is it delta encoded with respect to information that was sent previously?

I'm not sure I follow, but all calls are independent of each other.

> Is there some initial blob of information every instance starts up with that messages are deduplicated against?

> And that's before the behaviour on mutation - when one machine changes the data, how and when is that change reflected on other machines?

Not sure if this helps, but this [1] talks about the architecture. There's a C&C. The control-plane is the orchestrator for all machines.

[1] https://docs.differential.dev/advanced/architecture/


I'm guessing, but it should be possible to pass anything remotely as a proxy with an ID: calls to a closure or the resolution of a promise cause further RPC, and sending it back recovers the original value.


Not sure, but functions are serializable.


It’s technically correct but really we all know what was meant. They’re serialisable but not really deserialisable, even if it’s a pure function it might not even be serialised in a version of JS that the receiver understands.


> serialised in a version of JS that the receiver understands

I don't think the use case for this is as general as most RPC systems.

The linked site envisions a situation where you had an app running on one server, now you want to break that workload up while making the absolute minimum number of code changes.

All the clients and code would still be controlled by the same team.


that's still not something I'd want to inflict to myself. That means you need to somehow ensure the functions you pass are pure, and that upgrading Node or changing the compilation target could break the app if the release rollout is progressive.

Using function serialisation for applicative purposes is always going to be a hack that will come back to bite the team at one time or another.


The site lists all those features but it's not clear what problem it's trying to solve. Only in the FAQ at the end of the page it's said it's a service mesh basically with added bells & whistles, from what I understand.

But I'm still confused:

>Monolithic codebases don't have to result in monolithic services.

Why would I want to replace straighforward function calls with network calls in my monolith?

>By using a centralised control-plane, Differential transparently handles network faults and machine restarts with retries, all without changing your existing programming paradigm. It is designed to be a drop-in replacement for any function call that you'd like to make distributed and reliable.

"Centralization" and "reliability" don't sound like words that often come together. What if the control plane goes down? Unless they have a cluster? It's all self-imposed problems anyway: if we don't have network calls in the monolith, we don't have all those problems in the first place (which need to be solved with retries etc.)


Thanks for the feedback. One of the authors here.

> Why would I want to replace straighforward function calls with network calls in my monolith?

I agree that there's nothing inherently virtuous with replacing a functioning local call with a remote procedure call. But if you decide to break your monolith up for any one of the reasons people tend to follow service-oriented architecture, then Differential reduces the friction.

Here are our thoughts on this: https://docs.differential.dev/advanced/soa/

> "Centralization" and "reliability" don't sound like words that often come together. What if the control plane goes down? Unless they have a cluster?

Yes, there's a cluster.


The problem though is that you're taking the known failure mode interpretation of microservices (what we call a distributed monolith) and making it "easier" by adding significantly more complexity. As many of the other comments here have said, this has been done before. You are repeating the mistakes of DCOM and gussying it up with a modern-looking marketing page.

You demonstrate a lack of understanding of idempotence on your marketing page, which is enough to dismiss your service outright. With all due respect, I'm sure you've put a ton of work into this, and you may even manage to sell it to some teams, but this has been tried and failed so many times in software development's history. It won't be any different now. This is very likely to cause long term harm to any team that adopts it.

For anyone considering it, please make sure you actually understand idempotence and autonomy. Look into event sourcing. Study the fallacies of distributed computing. Check out the work of Udi Dahan and Scott Bellware. Run, do not walk from anything purporting to do the things that this does.


Thanks for the time and effort you put into writing this comment.

> As many of the other comments here have said, this has been done before. You are repeating the mistakes of DCOM and gussying it up with a modern-looking marketing page.

Respectfully disagree, it's not really comparable to DCOM and CORBA. It contains a control plane which is stateful [1]. CORBA does have an ORB [2], which is close, but the specification does not allow for durability.

I do apologise for having a modern looking marketing page.

> You demonstrate a lack of understanding of idempotence on your marketing page, which is enough to dismiss your service outright.

I think we discussed this elsewhere in the thread where I gave you the implementation details that prove the claims of the marketing page.

> With all due respect, I'm sure you've put a ton of work into this, and you may even manage to sell it to some teams, but this has been tried and failed so many times in software development's history. It won't be any different now. This is very likely to cause long term harm to any team that adopts it.

It's actually open-source [4] and self-hostable, so teams don't have to "buy" it. I admire you confidence about this failing, and you might be right. But the reason why this has been tried before in the software development history tells me that the proper abstraction is somewhere out there. If everyone resorted to not trying something that failed before from a new perspective, we probably wouldn't be very successful as a species. Everything is a remix, after all.

> For anyone considering it, please make sure you actually understand idempotence and autonomy. Look into event sourcing. Study the fallacies of distributed computing. Check out the work of Udi Dahan and Scott Bellware. Run, do not walk from anything purporting to do the things that this does.

Wise words, but I'd appreciate it if you could dive into the way things are [1] [3] a little bit, and let me know if you hold the same views.

[1] https://docs.differential.dev/advanced/architecture/

[2] https://www.sciencedirect.com/topics/computer-science/object...

[3] https://docs.differential.dev/getting-started/thinking/

[4] https://github.com/differentialhq/differential/


This would likely require a longer, higher bandwidth conversation than we could or should do in a comment thread. If you're interested, you can join me in the Eventide Slack [4], but I will warn you there's already some discussion about this offering (though, nothing I haven't said here). But, I'm here, so... :)

When I say it's like DCOM/CORBA, I'm referring to the notion of RPC. Specifically, the notion that it's a good idea to perform synchronous remote calls as a matter of course. It significantly reduces autonomy. In my opinion, we should not be encouraging people to build distributed monoliths, even if we can make it look like a regular monolith.

Monolithism is the problem [1] [2], and distributing that monolith only solves technical issues, but not the productivity ones that typically matter more. Distributing them also creates technical issues in terms of latency and cascading failures.

The existence of a control plane doesn't impact my concern about this, but it does add an additional concern. See smart pipes vs. dumb pipes. [3]

> I do apologise for having a modern looking marketing page.

That's not necessary, I wasn't saying that was bad. It looks nice, but if it's marketing something that can be harmful, that would be the problem. Its aesthetics are a differentiating factor. I wasn't trying to say anything more about it than that.

> But the reason why this has been tried before in the software development history tells me that the proper abstraction is somewhere out there.

Maybe, or maybe we don't pay enough attention to our past learnings and the body of knowledge that came before us.

> Wise words, but I'd appreciate it if you could dive into the way things are [1] [3] a little bit, and let me know if you hold the same views.

I read these, and there's nothing that I would consider new, although it is somewhat surprising to read what effectively amounts to the goal being to add network hops and complexity to monolithic code in order to get scaling and other technical benefits.

> Monolithic codebases provide a great developer experience, but resulting monolithic services often do not.

I disagree [5]. They only appear to provide that to teams that don't yet have the skills necessary to do proper partitioning. I believe that teams, and developers in general, will be better served by learning how to do this. They will unlock long term continuity [6] in productivity that they couldn't have imagined.

> Instead of n services talking to n services...

Here's a hint at the fundamental disconnect. Services shouldn't talk to other services. Period. Not if that can be avoided. See: autonomy. This is where pub/sub comes in. When services have autonomy, developers have productivity, and products are more resilient (at the very least).

[1] https://github.com/aaronjensen/software-development/blob/mas...

[2] https://github.com/aaronjensen/software-development/blob/mas...

[3] https://twitter.com/sbellware/status/1760037861347258416

[4] https://eventide-project.org/

[5] https://github.com/aaronjensen/software-development/blob/mas...

[6] https://github.com/aaronjensen/software-development/blob/mas...


Thanks for the invitation and your engagement.

I do disagree with some of your opinions, but I find them interesting and valuable. I've definitely changed my mind about things before, so you might be right on all of these counts - but our experiences might differ enough to have different opinions.

> Here's a hint at the fundamental disconnect. Services shouldn't talk to other services. Period. Not if that can be avoided.

I agree in principle, but I've not been able to see this followed in a large sprawling engineering organisation.

I will definitely join the slack once I have some distance from this. To be honest, I didn't expect this to be posted on HN at this point in the project - and HN can be somewhat merciless :)

Thanks again for engaging with me. You've inspired me to do some changes - especially in how I position the value prop and I learned a lot from reading your references. Much appreciated :)


I too appreciate your engagement. I understand that HN can be merciless, and I do try to be direct, which can sometimes be felt as merciless or harsh. I'm trying to give my unadulterated perspective because I care about the outcomes for the industry as a whole.

And yes, I can imagine it was quite a shock to get this much unexpected feedback all at once. You handled it well, from what I've seen.

> I agree in principle, but I've not been able to see this followed in a large sprawling engineering organisation.

Ours isn't large and sprawling (only around 15 devs) but we managed to build and maintain ~100 back end services and ~20 front end web applications and we're only speeding up as we go. It's possible, but it takes effort and it requires doing things that go against "best practices". [1] It also requires significant cultural and managerial work (see Lean, Steven Spear's work, etc.).

Good luck in your product and learning journey. Hopefully we can connect again down the road. Cheers.

[1] https://github.com/aaronjensen/software-development/blob/mas...


> I understand that HN can be merciless, and I do try to be direct, which can sometimes be felt as merciless or harsh.

I definitely didn't see your engagement this way. I believe you entered the conversation and held your ground with good intentions.


How does the idempotency work? I don't understand how functions can be made idempotent without changing or instrumenting their implementation for end-to-end idempotency.


You could have some side channel with an identifier for the RPC. If the server receives duplicates of one RPC, it ignores the extras.

It would require some client support, though. Alternatively, if the RPCs have the same arguments, the framework could choose to ignore seemingly identical ones, though that has its own issues.


Neither method works.

If the framework records the occurrence of the call before the effect of the function, you achieve at-most-once semantics. If it records the call after the effect, you get at-least-once.

The framework might perform its idempotency bookkeeping within the same transaction boundary as the function's side-effects, but this means the function implementation is no longer a black-box. E.g. you can no longer perform arbitrary side-effects while preserving the idempotency guarantees of the framework.


It’s at-least-once. From https://docs.differential.dev/advanced/compute-recovery/

> If a machine fails to send any heartbeats within an interval (default 90 seconds):

> It is marked as unhealthy, and Differential will not send any new requests to it.

> The functions in progress are marked as failed, and Differential will retry them on a healthy worker.

I would guess that “idempotent” functions in the system also take a lease out on the idempotency key. Perhaps they release the lease on failure since they can observe errors thrown, and they commit the key as consumed after success. ¯\_(ツ)_/¯ the docs are not clear on these semantics!


That's how we handle idempotency. It's basically a mutex on the idempotency key with a timeout (in our case using redis with redlock since it's a distributed system). Once the command finishes, the key is marked as handled and the lock is released. At that point any future or queued requests with the same key immediately return.


It sounds to me like there are scenarios where calls to, and certainly effects of, "idempotent" functions can take place twice.

I haven't looked at Redlock for a while, but I'm assuming you're all across the historical objections to it from distributed systems researchers: https://martin.kleppmann.com/2016/02/08/how-to-do-distribute...


Is there really consensus that practical systems need to have lock lease timeouts?


It’s a choice between at-least-once and at-most-once right? Without a timeout you get at-most-once since the lease holder may die off and never complete the task, and so the task of a completed 0 times.

I’m building a system right now and we’re going to use lease/heartbeats etc etc to elect leaders and whatnot, all this discussion seems very practical to me.


Exactly. It comes down to whether the consequences of these race conditions are acceptable or not. Sometimes (known and understood) race conditions are okay for performance reasons. Shoot, that's the whole philosophy behind eventual consistency. Sometimes they aren't, such as bank transactions. It's all situational.


Eventual consistency isn't a philosophy, and it's not the idea that race conditions are acceptable. It's a consistency model with specific properties and guarantees.

These races we're talking about don't produce eventual consistency, but inconsistency. Different systems require different levels of correctness, but if you're not familiar with the underlying theory then your trade-offs are going to be uninformed decisions rather than informed ones.


A race condition is the lack of determinism in a sequence of events that can lead to undesirable effects. Eventual consistency is an approach (and consistency model, design philosophy, whatever you want to call it) that accepts this lack of determinism as a compromise because the possible sequences are accounted for, since the system can still act on stale data until an update is fully pushed out (which is still undesirable, but an accepted compromise).


Ok, I think we could do a better job of explaining the semantics.

The system does not implement idempotency by default. If something that's not marked as "idempotent" fails, it does try again in a different machine.

If the function is marked as idempotent [1] then the system ensure at-most-once semantics by only issuing the call once and only once to a worker who can process the call.

[1] https://docs.differential.dev/advanced/idempotency/


Yeah, this is what it looks like in the docs; you choose an "idempotency key": https://docs.differential.dev/advanced/idempotency/


"Just decorate your idempotent operations" – implies to me that the idempotency of your functions is your responsibility.


That's one way of reading it - although I find it's at odds with the heading "Idempotency in one line of code."

If it's the case that the idempotency wrapper is only for operations that are already idempotent, it makes me wonder that the docs couldn't do a better job of communicating the purpose and function. Decorate your idempotent operations.. to make them more idempotent?


Not quite. The decorator [1] enforces you to provide an idempotency key, if idempotence is needed.

The control-plane intercepts that and implements idempotence for you.

[1] https://docs.differential.dev/advanced/idempotency/


Yes, quite. Unless my function has no side-effects, idempotency will always be my responsibility. This feature is a lie and an outright hazard. It should not be called idempotency and it should have a big bold warning stating that the function may be called more than once.


I'd appreciate it if you could point out how the function can be called more than once. Here's the source code for the orchestrator [1] and a demonstration of it [2].

If there's an error in the source code or the copy, I'll happily retract it.

[1] https://github.com/differentialhq/differential/blob/236ffc53...

[2] https://github.com/differentialhq/differential/blob/236ffc53...


Without looking at the code, I can tell you that this will reproduce one of two things:

1. Invoke function that has two side effects

2. Function does first side effect

3. Pull the computer's plug out of the wall

A framework can only do one of two things:

A. Invoke the function again

B. Not invoke the function again

If you choose A, you have at least once processing

If you choose B, you have at most once processing

Neither of these are exactly once, therefore, neither one of these are idempotent unless the first side effect that the function performs is also idempotent. That is, you must be able to retry that one.

I believe you mention exactly once processing in another comment. It's this. The massive lie that many of these types of frameworks tell is that they can manage idempotence for you. Aside from functions that are side effect free (that is, pure, that is referentially transparent, that is can be made mathematically idempotent) you cannot possibly make them idempotent. It's the two generals problem.

So, I believe what you have implemented is idempotence at the sender with the reservation pattern. This is not the same as idempotence at the processor which is literally the only thing that matters. Idempotence at the sender is simply a performance (or storage, in the case of event sourcing) optimization.

Please correct me if I'm wrong on any of this, but if I am, then clearly my understanding of the two generals problem and distributed systems are wrong and I'm going to have to do some significant rethinking of our system's architecture.


I don't think you're wrong about any of this. I just feel like we're talking over each other. You clearly demonstrate an understanding of the domain, so I'm happy to engage. After all, if we fail, we want to fail fast.

Maybe we can focus the conversation on the scenario you mentioned above.

> Neither of these are exactly once, therefore, neither one of these are idempotent unless the first side effect that the function performs is also idempotent. That is, you must be able to retry that one.

So, what I'm telling is, yes - the first side-effect (or any side-effect) that you put through the system can be made idempotent through the same tooling that makes the caller idempotent.

  func foo(n1) {
    return networkCall(n1)
  }

  func bar(n2) {
    return networkCall(n2)
  }

  func foobar(n1) {
    n2 = foo(n1)
    n3 = bar(n2)
  }

  func main() {
    foobar(42)
  }
Given the above, the framework allows you to wrap foo, and bar all independently in higher order functions which will not issue a duplicative function calls for the same idempotency key. Therefore, you can call foobar repeatedly, and get the same result.


Understood, and perhaps the difference here is more about what I said in my other comment. I don't see this as particularly useful, and I see it as deceptive. What matters is that when we intend to do something is that it happens. We actually want retries. We want exactly-once processing with at-least-once handling. That's the only way that I know of to build resilient systems.

Our back end components literally crash when they fail. No other processing is done. They crash and if there is a bug they will keep crashing. Because our back end components have autonomy, this results in a very narrow service disruption. We fix the issue, identify the root cause and eliminate it. As a result, we have extremely resilient systems that, even in the face of 3rd party downtimes, we can know that every command that gets submitted will be effected. Dead letter queues or anything of the sort are anathema. We try until we succeed. Idempotence is considered in every single handler all the way from the inside to the outside as it must be, mathematically.


I see elsewhere you mention that you use at most once processing. So, I was possibly mistaken that the function can be called more than once assuming you did this right (cleared the job before even invoking the job). My primary point still stands: this is not idempotent, at least not in the way that matters. This is single-try processing. If all you are doing is returning a synchronous error to the user so that they can retry (and again, every other operation leading up to the one that failed to process is actually idempotent) then this would not be catastrophic. The issue is when people start to think that this actually is idempotence, i.e., they attempt to reuse an idempotence key across requests when the first one straight up failed (or not, who knows).

In other words, don't use the idempotence helper if you care about the thing happening. If it's optional, sure, go for it. You could rename it to "tryOnce", or "maybeRun" rather than "idempotence" and it'd be more accurate / less misleading.


> You could rename it to "tryOnce", or "maybeRun" rather than "idempotence" and it'd be more accurate / less misleading.

This is valuable feedback. Thank you.


Here are the docs on idempotency: https://docs.differential.dev/advanced/idempotency/


You mentioned elsewhere that you're still finding the right wording for the product and its benefits. Fair enough. Personally, I think you should probably stop referring to this as idempotency. Maybe describe it as a retry policy, or as a choice between at-least-once and at-most-once. I expect most people with knowledge of - and respect for - distributed systems theory will be completely put-off by this wording.

In distributed systems, the reason idempotency is such a Big Deal™ is that it can be combined with at-least-once delivery to achieve exactly-once semantics. This isn't that. What you're describing as idempotency has the potential to mislead users, especially since you use examples like credit card processing.


Thanks for the thoughtful reply.

You might be on to something here:

> Maybe describe it as a retry policy, or as a choice between at-least-once and at-most-once

and we will consider changing it because I agree that conflating the concepts (or the appearance thereof) is not worth diluting the other parts of the offering.

I still struggle to find the difference between this approach and the likes of Stripe [1] and Temporal [2] because in practice, it yields the same result.

[1] https://docs.stripe.com/api/idempotent_requests

[2] https://www.restack.io/docs/temporal-knowledge-temporal-io-i...


One difference is that you are specifically creating distributed systems middleware, where your audience is going to be using your product in the implementation of a system of their own, and is more likely to understand the terminology and wish to know the exact guarantees and trade-offs being provided.

I find both those comparisons are apples to oranges:

1. Temporal does not claim, on that page, to be able to automatically make your calls idempotent. Instead, it is explaining how you - as the user - can and should write your activities to be idempotent. Take the payment code snippet: you can't just wrap `ExternalPaymentAPI.Process(paymentId, amount)` in an idempotency provider. You have to implement specific idempotency logic inside the activity. This is in line with the distributed systems theory of idempotency.

2. In these docs, Stripe describes an interface for achieving idempotency as an external caller. Since Stripe control the implementation of their own system, it is certainly possible that they have implemented their operations in a way that is truly idempotent. What your docs describe - the ability to make arbitrary operations idempotent by wrapping them - is simply not possible.

I wonder whether you're focusing on the idempotency key as the thing that is wrong here. There is nothing wrong with idempotency keys themselves - they are a great way for APIs to provide idempotency. The problem is that this guarantee can't be provided by a wrapper layer, it requires the operation itself to be written in an idempotent way. This is an example of the age old "end-to-end principle" from networks class.


Ok, yes - I'm with you.

I now realise how this statement could be misleading.

> To mark a function as idempotent, simply wrap it with the idempotent function.

When initially writing it, I thought it was a "given" that the wrapped functions are the responsibility of the developer. Now I realise this can also be interpreted as "we make your functions idempotent".

Thanks for the detailed breakdown. We will work on the docs to make it clearer.


If that was the assumption, then this whole thing makes even less sense. If your operation is already idempotent in its implementation, then wrapping it with this function is the last thing you'd want to do. By introducing at-most-once semantics on top of the call, it would defeat the point of making the operation idempotent to begin with.

The scenario where you'd want to use this `tryOnce` policy is probably the following:

1. You have an operation that is not idempotent and can't be made idempotent, for whatever reason.

2. You would prefer the failure mode of that operation to be that the effect doesn't take place, rather than that the effect takes place more than once.

BTW, you don't need to have an answer for everything. I don't think there has been any miscommunication or misunderstanding of the docs. I think people in this thread, including myself, have accurately identified that you haven't fully thought some of this stuff through. That's fine and normal for a product in this stage, but not being willing to admit it is less of a good sign. Good luck with the startup.


> If your operation is already idempotent in its implementation

No, this is not at all what I meant. I don't think there's any value in introducing at-most once semantics to an already idempotent function.

I'm definitely not trying to have an answer for everything. But I guess we can leave it at that, at this point.


If it "feels like local functions", how do you handle errors from the RPC layer / network?


This really should be addressed in the docs and I can't find it either. There's some stuff about retrying functions that failed sometime later which seems very different to local functions and very like not-local functions.


It's Typescript so presumably it makes the functions async to handle network delays, and errors cause exceptions to be thrown.

Doesn't seem like a big issue to me.


That isn't even scratching the surface of what can go wrong on distributed systems.


I agree that OP only reduced his statement to a subset of the things that can go wrong. But the error handling mechanism is uniform across any errors that can be caused.


Therefore the function has its own failure modes and also a set of extra ones from the execution framework, folded into an exception hierarchy?


That is how exceptions always work in these sort of languages. I think it's one of the reasons exceptions suck as an error mechanism (checked exceptions are the "answer" but hardly anyone uses those - so few in C++ that they were removed from the language!), but it isn't a new problem introduced by this framework.

In any case Typescript doesn't have checked exceptions so you're pretty much stuck with "any error can happen anywhere" in Typescript whether you use this framework or not.


So how does that map to an operation that was done with success, but the reply lost?


If the reply is lost, then the caller eventually times out, which causes an exception.

There's a bit of nuance here about where the reply got lost. 1. Between control-plane and the callee 2. Between control-plane and the caller

But broadly, unless idempotency[1] is requested, it can re-run the function.

[1] https://docs.differential.dev/advanced/idempotency/


Yeah, but that is the thing, in many cases re-runing is not an option, as most services don't guarantee idempotenty.


We might be talking over each other here. But your earlier point was about exceptions, and I've tried to answer it in that context.

Yes, most services do not guarantee idempotency. But your service could if you use the idempotency utility that Differential supplies.

If you don't care about idempotence in your service calls, you could let it re-run. I believe most service calls don't care about idempotence.

- Reading stuff from a database? It can safely be re-run.

- PUT stuff to an external source, all good.

- Creating a new resource? Well, the failure mode is no better than what your system would do with an equivalent service call via another mechanism. So you need to handle it / write it in a way that ensures idempotence / be ready not to care.

- Creating an order, or charging a credit card? Yes - that needs to be handled and can be handled with the idempotence helper.


It is at the function called level, which is all this is really trying to solve.

You can use normal code on top of this to solve the bigger problems.


Was it called once, multiple times, did a network split happened, was it a timeout, is the server side idempotent, was the operator actually performed and only the reply lost, only took too long due to heavy traffic,....?


All timeouts result in exceptions that the client can (or should) handle.

Server side can be made idempotent [1]. And there's a control-plane which acts as the orchestrator [2].

It's possible the operator performed and reply got lost. But if there's insufficient data whether it was actually performed, it can be performed again, unless idempotency [1] was requested.

[1] https://docs.differential.dev/advanced/idempotency/ [2] https://docs.differential.dev/advanced/architecture/


Only works when there is control over the source code from both sides, and even so, not everything can be made idempotent.


There _is_ control over the source code from both sides, yes.

> Not everything can be made idempotent.

Can you expand on this, please? I'd appreciate it.

If you mean to say that true idempotence can never be achieved, then you're right [1].

If you mean to say that the framework doesn't allow for the same level of guarantees that other widely used distributed system tools (queues, pub-sub, idempotency key headers) allow, then I'd challenge that notion.

[1] https://bravenewgeek.com/you-cannot-have-exactly-once-delive...


Those are things that could go wrong, sure. I don't think this framework is claiming to make all of those impossible. What's your point?


Author here. This is correct.

Any exception eventually bubbles up to the caller, who can handle this in a try/catch.


Is this just like trpc? https://trpc.io/

Or like rspc (in Rust) https://www.rspc.dev/


Close. Think of it like trpc / rspc with a service mesh that orchestrates calls between services.

It _does_ put an extra hop in your network calls, but it allows some the benefits of orchestration.

Here's an opinionated take: https://docs.differential.dev/getting-started/thinking/


I think they’re probably paying more attention to gRPC and GraphQL. They have a cloud offering whereas trpc seems to have a smaller scope and prioritize supporting every client and server.


If I understand the idea correctly, differential is for backend only (microservices architectures). There is nothing about browser support in the docs.


The downside of an approach like this is that it limits you to a single programming language on both the frontend and backend. This is exactly the reason RPC frameworks like gRPC use an IDL.


I agree, and disagree.

I agree that it limits the usage of the tool in a polygot environment. I disagree it's a downside.

The absence of an intermediary language does give some benefits to the first class citizen (in this case, Typescript).

However, there are some other developments [1] which attempt to make the Typescript type system an IDL to allow for better interop.

[1] https://github.com/Aleph-Alpha/ts-rs


Well, that's not fundamental. There are plenty of projects that support generating typed bindings for libraries in one language for use by programs in another language. You could do the same thing here.


Fundamentally you go against the grain by generating an IDL from source code. I've never seen this done properly such that it is an improvement over just authoring the IDL in the first place and generating from that.

If your IDL isn't a first-class citizen then what do you expect the derivatives to be?


I'm curious if people's qualms around abstracting the cloud/network also apply to the https://modal.com product. Differential seems like a similar project at its core, just focused on the Typescript + microservices ecosystem.


It's an interesting/useful concept and most big tech co's I've worked at have some form of this internally.

But I think it's a little dangerous to market this as "feels like local functions". It glosses over a lot of really important technical considerations (like, the network and all the dragons that come with it) which will eventually bite people who don't understand that, and will immediately turn people off who do understand that.

I'm not a marketing or copy expert by any means, but what might be a neat product is (rightfully) getting criticized in the comments here for the positioning.


One of the authors here.

Completely agree. To be honest, we didn't intend to have this many eyes on the product as we're still iterating (both the abstraction and the copy).

We've gone through a lot of iterations on the copy itself, and I agree that there might be a better framing here which works.


Good luck! I do think there’s real value here, and have often thought about building very similar stuff


Thank you for the words of encouragement.


How would this work in a serverless scenario where handler side runs in a Lambda for example?

It seems a bit of a chicken and egg in this case.

A service needs to register itself with a control plane. But for the service to start it needs to get a request (via lambda invocation).


Fair question.

> But for the service to start it needs to get a request (via lambda invocation).

A service can also start by manually `.invoke`-ing the lambda.

The control-plane will start the lambda function when there's work. Lambda "asks" for work to do. Once work has finished, lambda function exits.

When deploying, we start the lambda function once so it can come out for air and advertise itself to the control-plane.

This is an affordance we do only do for lambda, and currently in development with our deployment offering here [1]

[1] https://github.com/differentialhq/differential/blob/236ffc53...


One of the authors here.

Ok, this blew up, and I appreciate the engagement. We will attempt to get to all the questions.

We are still polishing the initial offering, so if the documentation seems lacking, it's because it very much is so. Honestly, we did not anticipate having this many eyes on the project this early.


Ok… but how does it work? Too much magic.


This (and the linked docs) might help understanding how it works: https://docs.differential.dev/getting-started/thinking/


The key word here is “feels.” I would not trust feelings when it comes to software systems’ reliability.


The "feels" part is around the ergonomics of function calls. It doesn't pertain to system reliability.


Huge congratulations on the launch!


Thanks Matt :)


Is it suitable to use in the browser-server architecture? I think the documentation lacks scenarios or use cases, which should also be included on the home page.


It is theoretically possible to be used in a browser-server setting, but it's not something we're optimising for at the moment.

Thanks for the feedback on the use-cases. Will work that into the website.

In the meantime, here's a write up on our value prop: https://docs.differential.dev/getting-started/thinking/


How is this a company?


It's really not a company (at least not yet).

We've been a bunch of devs hacking together on this for a few months (~6). And we definitely haven't had the blessing (or the curse) of ZIRP.


A ZIRP phenomena.


The feature pitch is one long case of loltears. I needed the chuckle

"Kids , get off my lawn."


phone rings

"Sir, its The Fallacies of Distributed Computing for you, on line 1!"


I'm more than happy to engage you on the fallacies [1] that you point to and how they are considered in the context of Differential. But as your earlier comment would show, I don't think you're engaging in a good-faith conversation.

So, I'm happy to leave it here and wish you a good day.

[1] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...


REST is enough


Can definitely appreciate this sentiment, and it's one we talk about [1].

[1] https://docs.differential.dev/advanced/comparisons/#comparis...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: