Drasi: Microsoft's open source data processing platform for event-driven systems

CharlieDigital · 2024-10-20T16:49:58 1729442998

Very interesting choice of using Cypher[0]

In 2014, we built a similar type event-driven system (but specifically for document distribution (a document can be distributed to a target set of entities; if a new entity is added, we need to resolve which distributions match)) and also ended up using Cypher via Neo4j (because of the complex taxonomical structure of how we mapped entities).

It is a super underrated query language and while most of the queries could also be translated to relational SQL, Cypher's linear construction using WITH clauses is far, far easier to reason about, IMO.

EDIT: feel like the devs went overboard with the mix of languages. Shoehorned in C# Blazor? Using JS and Jest for e2e testing?

[0] https://drasi.io/reference/query-language/

leeoniya · 2024-10-20T17:10:12 1729444212

> while most of the queries could also be translated to relational SQL, Cypher's linear construction using WITH clauses is far, far easier to reason about, IMO.

https://prql-lang.org/

CharlieDigital · 2024-10-20T17:15:13 1729444513

Didn't look too deeply, but one of the keys with Cypher (at least in the context of graph databases) is that it has a nice way of representing `JOIN` operations as graph traversals.

    MATCH (p:Person)-[r]-(c:Company) RETURN p.Name, c.Name

Where `r` can represent any relationship (AKA `JOIN`) between the two collections `Person` and `Company` such as `WORKS_AT`, `EMPLOYED_BY`, `CONTRACTOR_FOR`, etc.

So I'd say that linear queries are one of the things I like about Cypher, but the clean abstraction of complex `JOIN` operations is another huge one.

UltraSane · 2024-10-20T17:27:21 1729445241

The neat thing about Neo4j is that the [r] isn't a join, it is an actual relationship stored on disk.

refset · 2024-10-21T18:30:26 1729535426

Like a many-to-many join table?

CharlieDigital · 2024-10-22T15:28:18 1729610898

Like a many-to-many set of join tables because the `[r]` can represent any relationship between any two collections.

inkyoto · 2024-10-21T02:53:44 1729479224

> […] Where `r` can represent any relationship […]

… and «-[r]-» can represent any relationship direction, which obviates the need for constructing separate queries for inverse traversing relationships. Kinda like running a compiler forward and backward.

robertlagrant · 2024-10-20T17:16:30 1729444590

We made a health backend partly using Cypher and the only thing I found was the simple queries looked amazing, but as soon as you need to join non-linearly it started looking a lot like SQL again. And when you're using an ORM it stops mattering. And when you need migrations it gets painful!

CharlieDigital · 2024-10-20T17:20:58 1729444858

    > but as soon as you need to join non-linearly

At least in our use case, even with some very gnarly 20+ line Cypher queries, it never got to the point where it felt like SQL and certainly, those same queries would be even gnarlier as nested sub-selects, CTEs, or recursive selects, IMO.

Perhaps a characteristic of our model (a taxonomy of Region, Country, Sponsor, Program, Trial, Site, Staff for global clinical trials and documents required by Region/Country/Program/Trial).

UltraSane · 2024-10-20T17:25:28 1729445128

Cypher works really well with a well defined taxonomy.

UltraSane · 2024-10-20T17:24:48 1729445088

"you need to join non-linearly "

What does this mean?

FromOmelas · 2024-10-21T17:22:11 1729531331

presumably it has a semantic model of sorts, defining intrinsic relationships between entities (parent-child, composed-of, sibling-of, and so on)

A bit similar how certain joins in SQL can be very straightforward with the "USING" clause, or when it can rely on extra information such as analytic views to derive materialized views (vendor specific)

JanSt · 2024-10-20T17:04:29 1729443869

I too have great memories of cypher. Such an elegant way to write queries.

CharlieDigital · 2024-10-20T17:11:45 1729444305

If you haven't been following it, I recently found out that it is now supported in a limited capacity by Google Spanner[0]. The openCypher initiative started a few years back and it looks like it's evolved into the (unfortunate moniker) GQL[1].

So it may be the case that we'll see more Cypher out in the wild.

[0] https://cloud.google.com/spanner/docs/graph/opencypher-refer...

[1] https://neo4j.com/blog/cypher-gql-world/

otterley · 2024-10-20T17:41:47 1729446107

Looks very Azure-centric. Both installation guides (https://drasi.io/how-to-guides/install-sample-applications/b... and https://drasi.io/how-to-guides/install-sample-applications/c...) require Azure to work.

And then there's this:

> Installing Drasi in an EKS cluster can be significantly more complex than a standard installation on other platforms. Instead of downloading a CLI binary using the provided installation scripts, this approach requires modifying the source code of the Drasi CLI and building a local version of the CLI.

Is this an actual requirement or just the current easy path?

stackskipton · 2024-10-20T19:21:41 1729452101

Azure SRE here, it doesn't appear to have any Azure dependencies. CLI rebuild seems to be that "drasi init" assumes Azure Kubernetes Service built in StorageClasses for Kubernetes PVC for Redis and Mongo and thus fails when running against EKS. I assume same thing would be required on GKE. Yes, it should be more modular but MVP.

As for other stuff, it's using Gremlin Query Language or Postgres which are both open. In fact, it's going out of way it's not to use Azure authenication as loading connection string as Kubernetes secret is 100% AGAINST Azure Kubernetes Best Practice. Best Practice would be Workload Identity.

bob1029 · 2024-10-21T00:29:03 1729470543

> CLI rebuild seems to be that "drasi init" assumes Azure Kubernetes Service built in StorageClasses for Kubernetes PVC for Redis and Mongo and thus fails when running against EKS. I assume same thing would be required on GKE. Yes, it should be more modular but MVP.

None of these words are in the Bible.

ryanwjwaite · 2024-10-22T18:27:08 1729621628

You're right, it should work better on AWS, GCP, and other clouds. We'll get to that in future builds of Drasi. We've submitted to CNCF and, just like with Radius and Dapr, we'll make sure it works well on more than just Azure.

devjab · 2024-10-21T06:27:33 1729492053

Every bit of Microsoft open source is created at least partly as a sales strategy for Azure. They usually start within the Azure infrastructure because, well, why wouldn’t they? Then eventually they tend make it to where you can use them outside of Azure but they never quite leave the part where they are “better” if you’re an Azure customer.

Time will tell if Drasi is going to go the path where it becomes more easily useable outside of Azure (and in this case AWS) or it’ll go more of a Bicep route.

agentofreality · 2024-10-22T00:31:47 1729557107

As mentioned above, Kubernetes is (intended to be) our only platform dependency right now. Drasi is not yet ready for production use and can be explored using k3s, kind, AKS, and EKS, which we felt provided sufficient initial options for people to choose from.

In the coming weeks we will get more of our Sources and Reactions documented as well as docs on how to create custom Sources and Reactions. In the short term, if people have Sources and Reactions they want so they can integrate with a wider range of up and downstream systems, we would love to help support their efforts in developing these.

The Drasi Team is most active over on discord channel (https://aka.ms/drasidiscord), where we are happy to answer questions and help people get started using Drasi.

dtquad · 2024-10-20T18:02:41 1729447361

That is usual for new Microsoft open source projects. It takes 1-2 months for the Azure dependencies to go away.

3abiton · 2024-10-21T02:07:58 1729476478

I'm curious about the other examples? I get it though, as many of these projects are built fulfilling a specific need within MS infrastructure.

agentofreality · 2024-10-21T23:17:52 1729552672

I am the Drasi engineering lead and can assure you that any Azure-centricity is purely one of historical convenience and a lag in getting more of our non-Azure-centric doc, samples, and components published.

The main current dependency is having a K8s cluster.

You can run Drasi for dev/test on k3s(https://drasi.io/how-to-guides/installation/install-on-k3s/) or kind(https://drasi.io/how-to-guides/installation/install-on-kind/) and docker desktop also works but is undocumented.

Cloud based options include AKS (we will release the instructions soon) and EKS as mentioned. When we tested on EKS, we hit some storage class issues and decided to publish this with some work-arounds instead of holding back until we do a proper fix, which we will prioritize if there is demand.

On prem K8s should also work, but we haven't put resources into testing those scenarios. We would love to engage with anybody that would be willing to try this out.

Also, in the future we are thinking about other delivery platforms, not just K8S. You will see in the code that our dependency on k8s is abstracted.

If you have any questions, the Drasi Team is most active over on our discord channel (https://aka.ms/drasidiscord) and we would love to answer your questions and help ypu get started using Drasi.

gtani · 2024-10-23T11:20:48 1729682448

is there any lineage between this project and ReactiveX family developed at endjin now?

jameslevy · 2024-10-20T18:42:36 1729449756

Does it require Azure to work? Or could the Azure steps be relatively easily be swapped out for AWS/GCP/etc?

agentofreality · 2024-10-22T00:39:44 1729557584

Drasi does not require Azure to work. You install Drasi on a Kubernetes cluster, configure a Drasi Source to connect to a supported database, write Continuous Queries in Cypher Query Language, and configure a Drasi Reaction to do something with the output of the Continuous Query. For K8S, you can currently use k3s, kind, AKS (docs coming soon), or EKS (requires some work arounds).

Drasi docs are here - https://drasi.io/

The Drasi Team is most active over on discord, where we are happy to answer questions and help get people started using Drasi (https://aka.ms/drasidiscord)

pjmlp · 2024-10-20T18:10:56 1729447856

Azure is the new Windows, as timesharing OS, thus yeah that is to be expected.

dxxvi · 2024-10-21T03:38:03 1729481883

Is this what can be done with Apache Kafka Connect (to get data from another source to a Kafka cluster), Kafka (including Kafka Streams)? This image (https://github.com/drasi-project/community/raw/main/images/d...) is like Kafka Streams with a single topic. This image (https://github.com/drasi-project/community/raw/main/images/c...) is like joining 2 streams in Kafka Streams.

ultrafez · 2024-10-21T13:48:21 1729518501

It also seems reminiscent of KSQL - consuming multiple input topics, and producing output to a topic defined using a query written in a SQL-like language that defines how the inputs are combined and filtered.

gigatexal · 2024-10-20T16:36:49 1729442209

Oh this very much reminds me of [feldera](https://feldera.com) — they do incremental loads and computations using some novel approaches (most of which i am too dumb to follow). Really nice folks too.

woozyolliew · 2024-10-20T17:06:56 1729444016

Or the related Materialize stuff https://materialize.com/

hobofan · 2024-10-20T21:55:03 1729461303

I took a brief look into Drasi and it looks like it doesn't do any of the differential/timely dataflow stuff (like Materialize does), or any other sophisticated incremental view maintenance methods that are rooted in Microsoft Research.

fatliverfreddy · 2024-10-21T03:59:38 1729483178

I see more Cypher fans out here - check out https://cyphernet.es if you work with Kubernetes!

jeremycarter · 2024-10-21T10:17:42 1729505862

Brilliant

stefanos82 · 2024-10-20T18:18:17 1729448297

Drasi...React...well played Microsoft, well played :D

Assuming they choose this name from the Greek δράση which means action, React of course is the exact opposite to action, thus the React-ion; an action expects a reaction, somewhere somehow!

benbristow · 2024-10-20T18:52:43 1729450363

Not like Microsoft to name things well...

j-a-a-p · 2024-10-20T18:57:22 1729450642

VMS++ = Windows NT?

TeMPOraL · 2024-10-21T21:14:16 1729545256

Here I thought they were accidentally or intentionally referring to:

https://babylon5.fandom.com/wiki/Drazi

But now I noticed the spelling difference :/.

smarx007 · 2024-10-20T16:37:20 1729442240

https://azure.microsoft.com/en-us/blog/drasi-microsofts-newe...

mnsc · 2024-10-21T05:02:20 1729486940

I finished reading Kleppman's Designing Data-Intensive Applications last night and this looks like it's straight out of the last chapter that talk about the future. They don't use the term "dataflow" though.

https://www.oreilly.com/library/view/designing-data-intensiv...

9dev · 2024-10-21T05:55:13 1729490113

That one’s also on my reading list. Was it worth the read?

yas_hmaheshwari · 2024-10-21T06:00:03 1729490403

This book is definitely worth the read. Or maybe worth 10 reads. Its really that awesome!

xnorswap · 2024-10-21T08:42:51 1729500171

I read it over the summer and I'd say it's essential reading for any developer who deals with data.

Perhaps most importantly, the book empowered me to talk confidently about the trade-offs involved with different choices of handling data, and gave me a language framework to talk accurately about those choices.

Previously even the parts I did understand was from experience, and not an academic background, so my explanations were hand-wavy or sloppy, but now I can state my case for different solutions much more clearly.

imvetri · 2024-10-20T17:14:02 1729444442

What does it process it from and what does it process it to?

Is it programmable or you have a concrete concept theorised?

What is it useful for? How it helps business in saving cost or increasing profit? Is it a hobby project?

agentofreality · 2024-10-22T19:42:01 1729626121

This post may help answer some of your questions: https://opensource.microsoft.com/blog/2024/10/22/detect-and-...

And the project docs may help answer others: https://drasi.io/

Also, the Drasi team are most active over on our discord channel (https://aka.ms/drasidiscord) where we would be happy to answer questions and help you get started using Drasi.

emmanueloga_ · 2024-10-21T14:38:46 1729521526

They don't mention "CDC" (Change Data Capture) directly anywhere, but I think that's what Drasi is? (they call it "Data Change Processing platform").

"Debezium", an alternative CDC system, is mentioned in the documentation and sources [1]. I'm not sure if Drasi uses Debezium, or aims to be compatible with it. Maybe someone here can shine more light on the relationship between these two?

--

1: https://github.com/drasi-project/drasi-platform/tree/main/re...

agentofreality · 2024-10-22T03:16:24 1729566984

Hi, I am the Drasi engineering manager, maybe I can clarify. Drasi is not intended to be another CDC alternative. It doesn't compete with Debezium, in fact we already have some integration with Debezium and hope to do more in the future.

People often use CDC to replicate, consolidate, filter, and transform data. And sometimes they use it as a source of change events to build components/services that look for specific data changes and do something when they detect those changes. This kind of components/services tend to be more complex to build, operate, and maintain than expected. Especially if they bring together data from multiple sources, have complex criteria, need to react in near real-time, be secure, and resilient. Most people that have had to build these can probably agree that they would like it to be less complex.

Drasi was created so people don't have to build this kind of component/service. They just write a Continuous Query (in Cypher), and then configure it to connect to supported Data Sources and Reactions (which do something when changes are detected).

Drasi manages the connection to the Source systems to get the low-level changes when they occur (sometimes using the Debezium library), maintains a perpetually accurate result set for the Continuous Query, and every time a source change results in a change to the Continuous Query result, Drasi generates a diff and sends it to the set of subscribed Reactions. The Reactions do something with those diffs depending on their purpose e.g. update a database, post an event, send an email, send a text message. All of this with no code in a platform that can scale to support many such queries.

There is more to it, but a good starting point is if you ever think to yourself that you want to query a database and then compare the results to a previous query result, and you want to do this periodically, you might consider Drasi as an alternative.

The Drasi Team is most active over on our discord channel(https://aka.ms/drasidiscord) and we would be happy to answer questions and help you evaluate whether Drasi is something that might be useful to you.

emmanueloga_ · 2024-10-22T03:39:17 1729568357

Thank you for this answer! It makes more sense now that you don’t call Drasi a CDC, I think it could be useful to clarify the distinction in the docs (perhaps also comparing and contrasting with other systems that appear to be closely related, like streaming databases).

robertpohl · 2024-10-22T05:55:49 1729576549

Put this in your website! I understand now :)

resters · 2024-10-20T23:18:53 1729466333

This is a very solid pattern. Many systems that are built using traditional relational database systems would lend themselves to far simpler designs using this paradigm. It is not necessarily immediately obvious but nonetheless quite true.

unit149 · 2024-10-21T01:31:18 1729474278

Beginning with Boolean operators: and / or - this relational service model can distribute queries. Curious why Cypher [0] abandons this syntax.

lasermike026 · 2024-10-21T13:58:05 1729519085

I suppose students need to prepare to defend what they are writing. Also, teachers may need a bit of a demotion when making accusations of plagiarism or generated papers. Teachers at the very least should be able to reasonably prove their accusations. There is a greater problem with tutors writing students papers. If teachers and students worked more closely this wouldn't be an issue.

dijksterhuis · 2024-10-21T14:35:21 1729521321

i feel like this comment was maybe supposed to be posted under this article?

https://news.ycombinator.com/item?id=41896973

fatliverfreddy · 2024-10-20T16:52:29 1729443149

I wish I could use Cypher for everything

unixhero · 2024-10-21T07:52:57 1729497177

I would really enjoy using it. But as a novice data intensive application developer, why would I not query the table 30 seconds and look for changes with a Python program (or another regular programming language)?

bobnamob · 2024-10-21T08:05:42 1729497942

One of the best resources to understand "why would I not ... ?" in a data intensive context is Kleppmann's Designing Data-Intensive Applications[1] (mentioned elsewhere in the comments). There's a lot of nuance to why event streaming wins out over periodically "polling" a database, mostly about maintaining correctness while being able to scale horizontally.

Taking a look at the Kafka docs [2] is also enlightening.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

[2] https://kafka.apache.org/documentation/#gettingStarted

agentofreality · 2024-10-22T16:54:17 1729616057

Here is a new Drasi blog post that might help you understand why Drasi is preferable to writing code that periodically polls a database: https://opensource.microsoft.com/blog/2024/10/22/detect-and-...

purpleidea · 2024-10-21T17:16:07 1729530967

This feels like a specialized version of https://github.com/purpleidea/mgmt/ but Microsoft only.

SiddanthEmani · 2024-10-20T23:25:56 1729466756

Cypher is so cool. I included a graph database in my RAG patient chatbot

https://github.com/SiddanthEmani/patient_chatbot

sitkack · 2024-10-20T18:01:04 1729447264

But at what cost?

akmittal · 2024-10-21T01:47:16 1729475236

Go seem to be good choice for data processing systems.

iamstan23 · 2024-10-21T07:45:34 1729496734

Weird thing about this project is that neither the website (https://drasi.io) or the repo (https://github.com/drasi-project/drasi-platform) mention that it's a Microsoft project.

Also the only cloud provider it has installation instructions for is AWS's EKS platform. Yet it has integration instructions for Azure CosmosDb Gremlin API.

That one customer out there using EKS and Gremlin on CosmosDb is probably over the moon right now.

agentofreality · 2024-10-22T03:33:52 1729568032

I am the Drasi engineering manager, and I can assure you that any weirdness you sense is purely accidental.

The project was announced through standard Microsoft channels by Mark Russinovich and is part of the Microsoft Collaboration on GitHub. But we are predominantly an open source project from the Azure Incubations team which has a history of releasing open source projects. So we don't feel the need to constantly remind everybody that we are a Microsoft project and team.

The documentation site is missing some content that wasn't ready in time for the release but it includes AKS install instructions as well as additional Source and Reaction docs. These will be out soon.

If you know that customer that uses EKS and Cosmos Gremlin, please let us know, we would also be over the moon.

In any case, the Drasi Team is most active over on our discord channel(https://aka.ms/drasidiscord) where we would love to answer any questions you have about Drasi and help you get up and running.

vladsanchez · 2024-10-21T12:11:30 1729512690

https://azure.microsoft.com/en-us/blog/drasi-microsofts-newe...

> "The Microsoft Azure Incubations team is excited to announce that Drasi is now available as an open-source project."

f4c39012 · 2024-10-20T16:54:07 1729443247

Purple!

computronus · 2024-10-20T16:59:54 1729443594

Green!

JaimeThompson · 2024-10-20T17:49:35 1729446575

Just my luck. I get stuck with a race that speaks only in macros.