More

MrSaints · on Aug 5, 2021

Wouldn't there be latency overhead between the workers (edge) and Postgresql? I guess the post alludes to relying on caching on the edge to mitigate such roundtrips.

But I suppose that as long as you are dealing with data that is not easily cached, the bottleneck with using serverless edge will still be between the worker, and the database?

MrSaints · on June 15, 2021

+1 for this. The great thing is you still have the option to incrementally opt for SSR and/or enriching statically generated pages with build-time data.

MrSaints · on May 11, 2021

Is my understanding correct in that a "Pane" is essentially a UI component? And I am assuming there are tests to catch cases where a new pane is defined in the backend, but not in the SDK (or I suppose, how do you deal with any discrepancy)? Do you lazy-load the components?

wkiefer · on May 13, 2021

> Is my understanding correct in that a "Pane" is essentially a UI component?

Yes — it is essentially a UI component (and the only node that represents UI).

> And I am assuming there are tests to catch cases where a new pane is defined in the backend, but not in the SDK (or I suppose, how do you deal with any discrepancy)?

Great question! Because we use proto, the SDK and the backend share the same pane definitions. So we can't have a case where they don't match in the repo. However, SDKs that have already shipped obviously can't be updated with new panes. To handle this, we have a piece of the backend service called the "dispatcher" whose job it is to look at the initial /start request and determine which workflow should be used for that specific client.

Let's say a very old Android SDK is issuing a /start request. The dispatcher knows the version of the SDK requesting a workflow and programmatically can determine which major version of workflow graphs it supports, and pick an appropriate (older) workflow to send back where we know all panes are supported by the SDK.

We also have an escape hatch where the backend can tell an older SDK "actually, you're so old you should fallback on a webview implementation to get the latest and greatest experience".

MrSaints · on May 3, 2021

I'm quite curious about your scale / size of your team.

Most of the moving parts for k8s are easily handled if you go for a managed service.

Using Nomad alone is fine, but with Consul in the mix, it requires quite a bit of set-up (especially for some degree of HA), and from my experience, is no less harder than using something like k3s.

Overall from my perspective, it just seems like Nomad + Consul sits in two places.

Either for an org. small enough where HA is not a concern, so setting it up, and running it "on-premise" is trivial. Or, for an org. large enough where you can have a team dedicated to setting it up, and managing it to ensure it meets various SLAs.

Genuinely curious to know what's your experience been like, and if it matches up with this.

kawsper · on May 3, 2021

We're 1.5 people to do ops works, but no one is full time on it, I'm the main responsible and I have someone that helps me when he's interested. We are 5 developers.

Our whole platform is between 10-50 EC2 machines running a Nomad cluster, Nomad manages our Docker containers and with services backed by RDS.

I think managed services were in their infancy when we did our initial research back in 2017/2018, Tectonic+Kubernetes with CoreOS looked promising but they were bought by Red Hat and probably rebranded/merged/disappeared into OpenShift. EKS was in beta an only available in the US (we're in EU).

We did try Rancher but we hit issues with it.

I don't know if K3S existed yet, but just looking at the diagram on their website it does look quite interesting.

We launched Consul first and started defining all of our services, and after that we started moving applications into Nomad.

HA has been quite easy with Terraform on EC2. We build "golden-images" with Packer, and then launch them with Terraform, upgrading Consul is adding 3 new servers, making sure things are stable, and then removing the 3 old servers.

mrweasel · on May 3, 2021

> Most of the moving parts for k8s are easily handled if you go for a managed service.

Which large parts of Europe currently cannot do, or are at least unwilling to risk doing, due to Schrems II. Amazon, Google and Azure are the three providers with the best managed Kubernetes services, or that least those who require least involvement from the user.

Otherwise I completely agree, Kubernetes is less of an overhead, if you can rely on managed services. I do question the idea that Nomad can't easily do HA. If you can build an HA environment for Kubernetes, then you can just as easily do the same for Nomad.

MrSaints · on Feb 25, 2021

Agree on all points (and definitely, a mostly happy customer too).

(3) was the cause of numerous production incidents for us. We had to contact support to have it scaled up, and sometimes they'd take up to 3 working days to get back to us. Happily paid more for AWS to get better support, and stability.

MrSaints · on Nov 4, 2020

> The functions take about 10s-15s to execute on cold start

This may be a bit of an exaggeration, and may vary depending on your deployment (from experience, it takes up to 5s at most), but I agree. There is a very noticeable cold start time which makes it not ideal for any business critical services. It is probably only good for things like document conversions.

kevincox · on Nov 4, 2020

This is also something that you can optimize down to a small number of seconds. Crazy things such as bundling+minifying nodejs apps can make a difference. However that may make debugging difficult.

There is some latency in the infrastructure but I have found that most of the delay is puling the image and actually starting the app. So small images with few layers and fast boot up will help a ton.

MrSaints · on Oct 15, 2020

I'm quite curious about the URL service when deploying the server onto Kubernetes. How is the public `waypoint.run` able to access deployments in a Kubernetes cluster? I know it uses Horizon. But, are the requests somehow proxied through the deployed Waypoint server?

mitchellh · on Oct 15, 2020

We talk a bit about how this works here: https://www.waypointproject.io/docs/url

Our entrypoint reaches back out to the URL service. This feature is optional. That’s how it works. On our roadmap page you can see that we are planning various improvements to continue to make this more secure as well.

mfer · on Oct 15, 2020

Is the source for waypoint.run open source? Can we run our own?

I've not had a chance to dig for the source and it wasn't obvious when I skimmed the project.

mitchellh · on Oct 15, 2020

Yes it is: https://github.com/hashicorp/horizon and https://github.com/hashicorp/waypoint-hzn

Not we haven’t tried to make that easier to self-run, but we didn’t purposely make it difficult either. It just wasn’t a priority for an initial launch. We’ll continue to improve this going forward.

MrSaints · on Aug 20, 2020

What you have described is quite similar to what Lyft's Flyte is trying to accomplish https://flyte.org/

A lot of Tensorflow inspired DAGs approaches the described node processing in the same way.

MrSaints · on Aug 19, 2020

I've used Conductor, Zeebe, and Cadence all in close to production capacity. This is just my personal experience.

Conductor's JSON DSL was a bit of a nightmare to work with in my opinion. But otherwise, it did the job OK-ish. Felt more akin to Step Functions.

Arguably, Zeebe was the easiest to get started with once you get past the initial hurdle of BPMN. Their model of job processing is very simple, and because of that, it is very easy to write an SDK for it in any language. The biggest downside is that it is far from production ready, and there are ongoing complaints in their Slack about its lack of stability, and relatively poor performance. Zeebe does not require an external storage since workflows are transient, and replicated using their own RocksDB, and Raft set-up. You need to export, and index the workflows if you want to keep a history of it or even if you want to manage them. It is very eventually consistent.

With both Conductor, and Zeebe however, if you have a complex enough online workflow, it starts getting very difficult to model them in their respective DSLs. Especially if you have a dynamic workflow. And that complexity can translate to bugs at an orchestration level which you do not catch unless running the different scenarios.

Cadence (Temporal) handles this very well. You essentially write the workflow in the programming language itself, with appropriate wrappers / decorators, and helpers. There is no need to learn a new DSL per se. But, as a result, building an SDK for it in a specific programming language is a non-trivial exercise, and currently, the stable implementations are in Java, and Go. Performance, and reliability wise, it is great (relies on Cassandra, but there are SQL adapters, though, not mature yet).

We have somewhat settled on Temporal now having worked with the other two for quite some time. We also explored Lyft's Flyte, but it seemed more appropriate for data engineering, and offline processing.

As it is mentioned elsewhere here, we also use Argo, but I do not think it falls in the same space as these workflow engines I have mentioned (which can handle the orchestration of complex business logic a lot better rather than simple pipelines for like CI / CD or ETL).

Also worth mentioning is that we went with a workflow engine to reduce the boilerplate, and time / effort needed to write orchestration logic / glue code. You do this in lots of projects without knowing. We definitely feel like we have succeeded in that goal. And I feel this is an exciting space.

theptip · on Aug 20, 2020

Thanks for the thoughtful reply, this is very useful.

The concept of having business users able to review (or even, holy grail, edit/author) workflows was one of the potentially appealing aspects of the BPMN products; did you get a signal on whether there were any benefits? "the initial hurdle of BPMN" sounds like maybe this isn't as good as it seems on the face of it?

Also, how do you go about testing long-lived workflows? Do any of these orchestrators have tools/environments that help with system-testing (or even just doing isolated simutions on) your flows? I've not found anything off-the-shelf for this yet.

MrSaints · on Aug 20, 2020

You raised a pretty good point about being able to review the BPMN. I did not immediately think of this, but now that you have mentioned it...

1. It was good for communicating the engine room

I remember demo'ing the workflows within my team, and to non-technical stakeholders. It was very easy to demonstrate what was happening, and to provide a live view into the state of things. From there, it was easy to get conversations going, e.g. about how certain business processes can be extended for more complex use-cases.

2. It empowered others to communicate their intent

Zeebe comes with a modeller which is simple enough even for non-technical users to stitch together a rough workflow. The problem is, the end-result often requires a lot of changes to be production-ready. But I have found that this still helps communicate ideas, and intent.

You do not really need BPMN for this, but if this becomes the standard practice, now you have a way of talking on the same wavelength. In my case, we were productionising ML pipelines so data scientists who were not incredibly attuned to data engineering practices, and limitations, were slowly able to open up to them. And as a data engineer, it became clearer what the requirements were.

On the point about testing, the test framework in Zeebe is still a bit immature. There is quite a few tooling / libraries in Java, but not really in other languages. The way we approached it was lots of semi-auto / manual QA, and fixing live in production (Zeebe provides several mechanisms for essentially rescuing broken workflows).

The testing in Cadence / Temporal is definitely more mature. But you do not have the same level of simplicity as Zeebe. That said, the way I like to see it / compare them, you could build something like Zeebe or even Conductor on Cadence / Temporal, but not vice versa.

mfateev · on Aug 20, 2020

Temporal/Cadence provide unit testing framework that automatically skips time when workflow is blocked. So you can unit test using standard language frameworks (like mockito) to inject all sort of failures. And the tests execute in milliseconds even for very long running workflows.

MrSaints · on June 21, 2020

> after several attempts to cancel the contract, I cancelled the direct debit.

They did do that.

But, the issue is if it accumulates without a resolution, it'll reflect badly on your credit score.