Hacker News new | past | comments | ask | show | jobs | submit login
Eva – A distributed entity-attribute-value database in Clojure (github.com/workiva)
255 points by CurrentB on June 28, 2019 | hide | past | favorite | 62 comments



I'm excited that some of the ideas from Datomic are starting to make it into some open source projects. With crux last month and now Eva, there are now multiple options for modern clojure databases.

While there are many excellent ideas embedded in Datomic and these projects, for me just being able to persist the same data structures you're using at a repl and query for them with data is a huge win vs having to start translating types and concepts and query strings to and from SQL is a huge win.


> I'm excited that some of the ideas from Datomic are starting to make it into some open source projects

Now if only they’d make it into open source projects in languages other than Clojure. I want my Datomic-for-Elixir, darn it! (And, if I wasn’t too busy, I’d be the first person to volunteer to build it!)


It's not precisely the same thing, but I wonder why the Elixir community doesn't seem to use mnesia too much. It ships right there with the OTP and you can store any erlang term.


Mnesia works great for what it’s designed for. What it was designed for, though, is “replicating connection state between a hot-master telecom switch and its warm-standby switch, so that the warm standby can be instantly promoted to master.” Which... isn’t a common use-case, really. I’d describe the result as “ETS synchronization”, rather than a true DBMS.

Mnesia has had features like disc copies bolted on after-the-fact, but you can tell from the way they’re implemented (e.g. manual node-crash recovery) that they’re not used by Ericsson in the form you find them in OTP, but rather that these are just framework hooks where Ericsson has built (and expects you to also build) specialized-to-your-use-case DBMS logic on top of.

And, likewise, Mnesia assumes OTP’s “distribution set” model (i.e. a static set of known operational relationships between nodes) rather than a clustering model; Mnesia has no clustering support per se—you need to take down the whole Mnesia application across the distribution set and start it back up if you want to change its node membership.

These things are fixable, but the result wouldn’t be really be “Mnesia” itself any more, but rather a freestanding DBMS system that uses Mnesia as its storage- and transaction-linearization engine, maybe Lasp for clustering, CRDT-annotated field-types in table schemas for resolving state after crashes, etc. Still, I’m surprised nobody has bothered to build such a thing and open-source it.


Just as clarification, if you want a node to join a different cluster you have to take down the Erlang VM and restart it. If you want it to just rejoin the cluster you don't (i.e., you have to change the cookie to match that of the cluster you want to join). But given a set cookie, and service discovery, you can join or leave a cluster without issue.

That said, I generally agree with you. Mnesia is perfectly serviceable, but it is, as you say, really only a good fit for manual intervention in the event of service interruption. That manual intervention can be after the fact, but it can't automatically resolve a partition in the event of a netsplit. It's not too bad to extend to write code to automatically resolve those partitions (I've POCed it; we ended up with a manual button to execute the code to do it just because of our own concerns, but it worked quite well), but you have to start making some decisions, and write some code, on how to do that.

Which is why people tend to use other things; there are other solutions that have "what do I do in the event of a netsplit" already built in. And while they may be faulty under certain circumstances (as Jepsen tests have shown is usually the case), most people will take mostly correct self healing over no/DIY self healing.

A few other things; Erlang's default distribution (which Mnesia leverages) was not built for remote distribution (i.e., nodes located in different data centers), so no clue what happens there. It also was built with a fully connected topology, which limits how many nodes it's reasonable to connect to the cluster.


Which databases are you currently using that are written in Elixir?


Of course, I meant Erlang; these things (Riak, CouchDB, RabbitMQ) always end up getting written in Erlang.

(Though I don’t think there’s been a new major infra component project started since Elixir became a viable contender, so maybe that could change.)


I didn't realize those were written in Erlang! Similarly, on the clojure side, Datomic and Crux I think are both written in Java, despite generally being consumed by clojurey interfaces.


Crux is definitely written in Clojure and Hickey has said Datomic is written in Clojure in the past.


can't agree enough! was really interested in the now defunct mentat project, but was written in Rust - it's a game changer to have a JVM-powered datomic-ish project with time as a first class citizen.


> time as a first class citizen

It's worth noting that time now has two levels of meaning in this context - transaction time and valid time.

Disclosure: working on Crux


hey, thanks for the reply. could you highlight the differences between Eva and Crux as you see them, today?

i do like passing in a timestamp and getting the state of the world as Crux seems to implement, whereas Eva seems slightly less oriented on time based querying, more so for a sort of log setup (where they have EAV+T+added?)


Well, Eva overlaps very closely with Datomic so I would recommend taking a look at the FAQ entry I wrote on the Crux/Datomic comparison (with 8 sections covering the major differences!): https://juxt.pro/crux/docs/faq.html#_comparisons

`EAV+T+added?` is the canonical data representation used by Eva, whereas Crux relies on two discrete "document" and transaction logs as the canonical representation. There are many reasons for this (data eviction, "unbundled" scalability etc.) but bitemporality (i.e. the ability to transact into the past/future) is definitely the primary reason why Crux doesn't follow the `EAV+T+Added?` pattern. You can get a feel for the indexes Crux uses here: https://github.com/juxt/crux/blob/master/src/crux/codec.clj#...


I suspect by "ideas from Datomic" you are referring to ideas predating Datomic (and Clojure) that are used and possibly refined in Datomic.


To be fair, a quick google search for "open source datalog database" does not return much that is usable out-of-the-box, other than datomic: mostly academic projects or adapters for other databases with some different querying API, like SPARQL.

Datomic seems to be one of the best ones in terms of execution of the ideas behind it and usage in real world projects, to say the least.


I think he meant to say they were not mainstream


In the Clojure community, no new technology officially exists until Rich Hickey invents it.


My impression is that the Clojure community (including Rich Hickey) is generally transparent and proud about digging up and reusing ideas from the 80s in a modern context.


Can you elaborate a bit more about their relationship?


> Workiva has decided to discontinue closed development on Eva, but sees a great deal of potential value in opening the code base to the OSS community.

I'm confused -- does this mean that Workiva themselves are not using Eva? Or are they still using it, but not officially developing it any more? If they were really invested in it, why would they only allow employees to work on it in their 10% time?


It was open sourced because it was a cool piece of technology and it would have been a shame to keep it back, but it is no longer in use at Workiva.

Source: I work there, although I have literally nothing to do with this project.


Interesting! Are you guys switching to another EAV store like Datomic or something else?


Question for Eva and Datomic users: How often does your org miss having a separate relational database around for data folks and product folks capable of (read only) SQL? Do you find yourself making a separate store/s to support them?


One thing that's nice with Datomic (don't know about Eva, but probably the same) is that syncing data to other stores is straightforward, thanks in particular to the the easy change detection.


Oh sweet memories

I have created almost identical implementation of this as an embeddable library in a year 2000 along with proprietary SQL language. The reason was that my client had inventory of products with some crazy amount of attributes and each product can have its own set and the client kept changing, creating, deleting those. It was in memory but with persistence and atomic transactions. No history though. It was blindingly fast on complex queries. And the schema of the database was kept as a set of entities with some predefined names, values and range of id's .

For a while I was contemplating releasing it as a standalone product but as I had enough tasks on my plate decided not to do it. Kinda feel sorry now ;(

So all in all very close by idea.


Last time I saw the EAV pattern used was in Magento. It was an absolute nightmare to query against.


I loved using Datomic, but I got too used to its features. I'm now back in a PostgreSQL shop and it feels like we're back in 2002. So many times have folks asked for features that would have been available for free in Datomic or another datastore that are just way too complex to maintain on top of a traditional rdbms.


Could you elaborate some of those features?


There always seems to be some demand to see how records change over time. If you plan ahead, know which tables the business wants historic values for, then you can build in that ability from the start. Unfortunately, these demands always arise at a later point in time or as one-off uses, and are hard to justify the effort to ever implement. With datomic, looking across history is free, fast, and easy.

Additionally, the ability to pin a database at a specific point in time is useful for APIs. For one, on pagination so that you don't see pages change out underneath as records are updated. Another is in APIs that join configuration values with time-series data or aggregations. If our data decisions drive revenue, for example, looking at 2018-08-31 revenues with today's business data has a lot less insightful value than being able to see them with other data pinned at that same date. Pointing the queried DB to a fixed timestamp in Datomic is free.


Interesting, I was fairly confident you were going to write about Datalog and about the flexibility of having such a minimal and evolvable schema :)

One of the motivations for Crux was that `as-of` queries still weren't powerful enough for what the business really needed, which was something like `as-of` that could cope with retroactive corrections for out-of-order and delayed ingestion (of upstream transaction data).


PostgreSQL is behind on this particular SQL feature, but MariaDB has it: https://mariadb.com/kb/en/library/temporal-data-tables/


That’s a very cool addition! I wonder if this can be used in an event sourcing architecture with minimal overhead.


I think to be really useful an event sourcing data store needs `as at` support in addition to `as of` (i.e. bitemporality) to cope with retroactive and proactive events and corrections. There are a couple of references for this in the Crux docs: https://juxt.pro/crux/docs/bitemp.html#_known_uses


I agree that the experience of querying EAV databases using SQL can feel cumbersome [0]. But Datomic's dialect of Datalog [1] is a great fit for EAV and can produce much more readable and simpler queries.

[0] https://gist.github.com/grafikchaos/1b305a4e0b86c0a356de

[1] https://docs.datomic.com/on-prem/query.html


EAV is not for everyone. You have to have a data set for which each entity can have values defined for some of, but not all of a large set of attributes. It’s a bit like sparse matrices, actually.

And, yes, you want your tools and application to abstract away some of the messiness involved in something as simple as “load/save this object from the database,” otherwise you’ll end up with a much larger amount of code for such a simple operation. I’ve seen it done in a Django app, and it’s not bad to work with once those abstractions are in place.


Why not just use a json store for those sparse attributes?

EAV performance at large scale is really dreadful. So many queries.


EAV with a query engine and syntax not designed for it is awful. EAV (and AEV and AVE and VAE) with datalog queries is performant, a joy to use, and way more flexible than json stores.


I don't think Magento used it right.

I used EAV in an application and it was fine (I used Datascript on the client-side, with server-side data being stored in a JSON document database, RethinkDB). I did run into some annoying issues with the lack of nil value handling. I'd say writing queries was difficult, but not significantly more difficult than in any other language, if you care about performance and want to know what the query actually does.

Overall, I felt there was a good mapping between my domain model and EAV.

I eventually dropped this solution in favor of Clojure data structures: if you have an in-memory (in-browser) database anyway, why keep the data in Datascript, if you can simply keep it as ClojureScript data structures?


As I mentioned here: https://news.ycombinator.com/item?id=20308175 , I think Magento did use it right, but EAV is impractical on a relational database not designed for EAV. This fact, however, doesn't show itself until you've reached a certain (arguably low by today's standards) scale.


I also started with Datascript both on frontend and backend, but ended up using instead its core data structure - persistent-sorted-set [1], which works really well if all you need is indexing.

[1] https://github.com/tonsky/persistent-sorted-set


I don't know for sure, but I suspect there is a significant usability difference between an "EAV pattern" and something engineered from the ground up to make EAV practical.


For those of us who have never used such a thing, what didn't you like about it?


The EAV model is great in theory and is probably great in practice _if_ the backing software has been _designed_ with EAV in mind.

In the context of Magento, it is a real _nightmare_ and is one of the major contributor of the slowness of the Magento platform (at least for magento < 2.0).

The reason for this slowness is that in a relational database, the EAV model makes it so that e-v-e-r-y s-i-n-g-l-e SQL query is one gigantic query made of tons of JOINs.

To give you an example, querying a product in Magento may need to join no less than 11 tables!

catalog_product_entity, catalog_product_entity_datetime, catalog_product_entity_decimal, catalog_product_entity_int, catalog_product_entity_gallery, catalog_product_entity_group_price, catalog_product_entity_media_gallery, catalog_product_entity_text, etc, etc.

In order to fix this issue, the Magento team created what they call "flat tables" which are tables that are created by querying the database with an EAV query (i.e. the query with a million joins) and putting the results in a table with as many columns as there is attributes being returned by the original query.

In theory choosing to use EAV was an amazing idea. In practice, this idea did not scale for large Magento stores and it has made Magento hugely complex, slow and hard to use.

We use Magento at betabrand.com and I can confidently say 90% of the slowness of our website is due to Magento's EAV tables and we have spent a humongous number of engineering hours optimizing this.


Are you on Magento 1 or 2? We've been building a site in Magento 2 and while the number of joins, hassle with re-indexing performance, and lack of null vs zero vs empty string handling in the flat tables are all very annoying, the actual site is quite fast with all the caching turned on on a server with enough gerbils behind it.

My main complaint with Magento 2 is with the feature that is it's biggest selling point: its flexibility. The fact that any public method on any class is wrappable/replaceable, and that any class in the system can be replaced wholesale, and any javascript or any template in the system can be wrapped or replaced, from anywhere, by any module, at a distance, just makes the whole thing a huge cluster to deal with once you get any number of 3rd party modules.


If someone wants to really make money, please clone the Versant Object Database.


Can someone explains how anybody is willing to even give a chance to a software product that is sold like this? [1]

* Generic/"bootstrappy" looking site, mostly marketing speak.

* Not even one code example of what it looks like to solve a problem with this DB.

* Gigantic "Get a Trial" button that links to a lengthy form that I'm sure has 99% abandonment rate.

There are a lot of other software products that are marketed like this, so I'm genuinely curious how this works.

1: https://www.actian.com/data-management/nosql-object-database...


> "the website uses bootstrap, the product must be crap"

congrats on speaking about a product you haven't even tried.


I've used Versant back in 2001. It was terrific and very fast, but wasn't a good fit for reporting/analysis.

Versant was building systems that were updated on the fly without shutdown, which was quite an achievement back then (and probably still is: Imagine updating a running instance of PostgresQL)


The company I work for runs dozens of Versant instances and it generally is immensely fast and rock solid.

The Object-Oriented Database (OODBMS) space is basically just Versant. Actian however seems not to be developing it actively anymore and it's generally not that much advertises (they rebranded it as "Actian NoSQL" apparently).

I wonder if anybody else still uses it (besides us).


Versant, while an awesome ORM, taught me that I hated ORMs ;)


Versant is not an ORM


Shit, been so long I forgot. Versant was cool. Too many databases floating around in my head :/

TopLink was the ORM that made me hate ORMs.


The name of this is a bit confusing since there is another language by the same name that also is related to databases: https://link.springer.com/chapter/10.1007/978-3-7091-7557-6_...


E-A-V is basically a JSON/BSON document store, perhaps with typing. Anything that distinguishes it from those datastores?


It's the other covering indexes (i.e. AVE, AEV, VAE) that make it interesting because they allow the query engine to perform efficient ad-hoc graph traversals.


EAV is the best schema. How you can do NoSQL in SQL.


Glad to see another contender in the EAVT field! Yay!


AP or CP?


Definitely CP. With Eva and Datomic, transactions are serialised through a single transactor node at a time.

The transaction log design is a fundamental design tradeoff that Eva/Datomic/Crux share, which means system throughput is limited by the throughput of a single process. The argument in favour of such a design is that most businesses & business applications don't actually experience transactional data volumes over 10K tx/sec.


EAV is generally an anti-pattern. How does Eva break away from its long, dark history?


I reject the premise of your statement. It's only a failure when implemented on top of relational databases which are (obviously) not optimized for it.

What are the failures of EAV except when applying it to a database optimized for something different?


Build Facebook on it, I suppose.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: