I always fancied a database that worked like a dynamic version of prolog, with facts that could be added over time, queries that can use inference etc. I couldn't really find anything that worked like that though, everything seemed to either have a fixed set of facts or didn't support inference. Looking for something like that is what first got me interested in differential dataflow.
The fact that this is bottom up makes me think that you probably can't go crazy with the number of rules.
Dynamic changes to the database have advantages and drawbacks. On the plus side, they let us incorporate new knowledge and rules, and we automatically benefit from built-in features such as argument indexing. However, running the same query twice can lead to different results, changes in the order of goals can affect the meaning of the conjunction, and we can no longer reason about different predicates in isolation. Dynamic changes of the database are said to be side-effects, since they affect the program in a way that is outside the logical interpretation of the given definitions.
>> Dynamic changes to the database have advantages and drawbacks. On the plus
side, they let you incorporate knew knowledge and rules, and you automatically
benefit from built-in features such as argument indexing. However, running the
same query twice can lead to different results, changes in the order of goals
can affect the meaning of the conjunction, and you can no longer reason about
different predicates in isolation. Dynamic changes of the database are said to
be side-effects, since they affect the program in a way that is outside the
logical interpretation of the given definitions.
I wonder if that is part of the benefits advertised by DDlog, the ability to
update a program without having to worry about clause ordering (which one does
not have to, with datalog and bottom-up evaluation) or generally side effects.
Very tentatively, I'm guessing that, since datalog programs lack negation, any
new facts or rules added to the (extensional) database will update the "output"
of the database (its Herbrand model) monotonically, i.e. you'll see new facts
comeing out but no facts will "disappear". I'm a bit unsure because the DDlog
github is talking about functions and typed datalog and I'm not really sure how
that works now.
Well, of course you can have negation and functions in "a" datalog language (there are different variants) but the point is that bottom-up evaluation is not then decidable.
Out of curiosity, what kind of negation do you mean? Negation as failure or classical negation? And what do you mean about limitations?
Hmm, I thought stratifying Datalog allowed bottom-up evaluation to remain decidable despite the presence of negation? And surely there are at least some kinds of functions that would also preserve deciability, even without stratification?
I admit I don't know very much about Datalog, and I know you know a substantial amount, so maybe I'm just confused. But I'd like to understand how.
Eh, to be honest I don't really know a substantial amount about Datalog. My
specialty is Inductive Logic Programming and I'm interested in Datalog because
it's decidability makes it a convenient represantation language for
machine-learned logic programs. A very simple version of Datalog is sufficient
for my purposes so I haven't really dabbled much in variations or more modern
versions.
So I might well be wrong about decidability depending on the absence of
negation. Now that I think of it again, the problem with negation, in particular
the negation-as-failure (NAF) in Prolog, is that it breaks the monotonicity of
inference. Monotonicity in this context means that the model of a logic program
(the set of facts that are immediate consequences of the facts and rules in the
program, and that are derived in a bottom-up fashion with a TP operator, in
let's say "traditional" Datalog) can only increase with the introduction of new
facts. With NAF, this is not the case -introduction of a new fact may make a
previously derived true fact now false. So maybe the problem with negation, at
least NAF, is that it breaks the soundness of the inference procedure because
facts already derived as true become false when new facts are derived.
I give a (very crude, sorry) example of how Datalog's bottom-up evaluation works
in an earlier comment:
In short, bottom-up evaluation proceeds in discrete steps where at each step a
new set of facts is derived from the facts and rules known so-far. Newly derived
facts are added to the program so the set of facts in the program increases
until no new facts can be derived. Without NAF, when a new set of facts is
derived and added to the program, the truth value of already derived facts
cannot change. But _with_ NAF, bottom-up evaluation may introduce a new fact in
step k that makes a fact derived in step k - j false. So now we have an unsound
derivation procedure.
I wonder if this unsoundness actually translates to undecidability. Suppose the
above happens - we introduce a new fact A and some existing fact B becomes
false. What can we do to avoid this? Well, we can't know the truth value of B
after the derivation of A before actually deriving A (because we don't know the
truth of A before we can derive it), so we can't avoid deriving A. What we can
do is go back and re-evaluate the truth of each derived fact, find that B is now
false, and remove it from the program. But what if removing B allows a new fact,
C, to be derived which was previously false because of B, and C is such that A
is now false? Well, A cannot be derived given C, and without A we must derive B.
Which means we have to get rid of C again and allow A back in. So now we're
stuck in a loop.
I don't know if this is actually something that can happen or not, I was just
thinking through the problem right now given your question, so I may be talking
nonsense. In any case, if there is such a problem with NAF, then maybe modern
datalogs have a different negation scheme, allowing classical negation (as does
Answer Set Programming). But I really don't know about that, hence my question
to the OP about the kind of negation they mean.
Which was a genuine question btw! I really don't know how modern Datalogs work.
Anyway I'll go check a few sources and see if my sketch proof of undecidability
above makes any sense and I'll come back to let you know what I find :)
So this might take a while- there is a lot of literature on datalog variants with and without negation or recursion etc. I will have to check with some coleagues. I'll try to come back to you by Tuesday the latest.
In the mean time you might find the following reference useful (or not because it's typically thick in opaque notation):
It's a textbook on databases. There are several chapters on datalog, including negation and recursion which may be of some help, but so far I can't form an intuition about when datalog is decidable with negation. I think it is, but with restrictions e.g. stratification, as you say, or restricting to non-recursive programs, disallowing nonmonotonic updates (as in my sketch proof above) etc. Indeed, it's a bit frustrating because the literature seems to be endlessly fragmented into sub-languages with different properties that are difficult to summarise concisely and I could just not find a straightforward answer "this works" as far as I looked (which was a couple of hours since I left my previous comment).
As I say in my comment above, there is a substantial amount I _don't_ know about datalog :)
Ouch. There's 6 upvotes on this comment which I take to mean there's (at least) 6 people waiting to hear what I come up with.
Unfortunately dear co-HN'ers I've come back to you empty-handed (and two days late... oh god). I contacted a colleague who is much more knowledgeable about bottom-up evaluation than me but I haven't heard back yet and that contact, plus my light reading of sources is really 90% of the bandwidth I have to devote to this right now. I have a paper deadline on the first of May that necessarily has to take up most of my brain. If any of you 6 (or anyone else!) is still interested by that time, please email me at the email in my sig and I'll be happy to dig further then, because this is something I'm actually intersted, too.
You're not letting us down! We're collaboratively exploring a fascinating space of promising ideas!
Please don't feel obligated to answer the following before May, or even read it.
I didn't realize the space of Datalogs was so broad—I had thought that there was a single "stratified semantics" that was the standard. The stratified semantics that I knew about permits bottom-up evaluation with negation as failure to be decidable by putting negations of any predicate X/n into a stratum strictly higher than X/n (which is evaluated later), and non-negated consequences of X/n into a stratum equal to or higher than X/n's stratum.
then we put baz/2 in a lower stratum than foo/1, and quux/1 in a stratum not lower than foo/1. That way, by the time we are trying to infer foo/1 facts and quux/1 facts, we've already finished inferring all the baz/2 facts. Then, there's no way that we can infer more baz/2 facts in the future that could invalidate our foo/1 or quux/1 inferences.
And this avoids the kind of non-monotonic loop you're talking about: if inferring fact A invalidates the inference of fact B, then B is in a strictly higher stratum than A, so we haven't inferred B yet at the time that we infer A.
I think that with unrestricted negation the inconsistency or nontermination problem you mention does exist, in which there exists no (consistent) model for a program (is that the right terminology?); the simplest example would be:
epimenides :- \+ epimenides.
The stratification restriction avoids this because it would require the stratum of epimenides/1 to be strictly greater than itself, which a standard CLP(FD) solver like SWI-Prolog's will easily tell you cannot be done:
?- use_module(library(clpfd)).
% library(error) compiled into error 0.00 sec, 17,872 bytes
% library(apply) compiled into apply 0.00 sec, 29,088 bytes
% library(assoc) compiled into assoc 0.00 sec, 36,240 bytes
% library(lists) compiled into lists 0.00 sec, 25,304 bytes
% library(pairs) compiled into pairs 0.00 sec, 9,040 bytes
% library(clpfd) compiled into clpfd 0.03 sec, 736,792 bytes
true.
?- X #> X.
false.
I think there's an additional problem as well, corresponding to Henkin sentences—programs for which models do exist, but for which there is no unique minimal model. For example, we could try to formalize George W. Bush's foreign policy:
Now we have two minimal models, one in which person(chirac), withUs(chirac) and one in which person(chirac), withTheTerrorists(chirac). (Thus we can explain Freedom Fries, one of the most ridiculous and terrifying aspects of the politics of the early 02000s.) Further inference rules might rule out one, the other, or both of these models.
In the same way, the stratification rule rejects this program: withUs/1 needs to be above (evaluated later than) withTheTerrorists/1, but also vice versa. I'm not sure how to state this so SWIPL's CLP(FD) can tell that it's unsolvable, but that's probably because I'm a total noob at logic in general:
?- X #> Y, Y #> X.
X#=<Y+ -1,
Y#=<X+ -1.
?- X #> Y, Y #> X, label([X, Y]).
ERROR: Arguments are not sufficiently instantiated
Of course it is not a hard problem to assign strata—a straightforward topological sort will immediately reject the loop.
The stratification rule is conservative, though; not only will it reject all the undecidable problems, it will also reject some programs where a unique minimal model does exist; as a trivial example:
In the same way, the stratification rule rejects this program: withUs/1 needs to be above (evaluated later than) withTheTerrorists/1, but also vice versa. I'm not sure how to state this so SWIPL's CLP(FD) can tell that it's unsolvable, but that's probably because I'm a total noob at logic in general:
?- X #> Y, Y #> X.
X#=<Y+ -1,
Y#=<X+ -1.
?- X #> Y, Y #> X, label([X, Y]).
ERROR: Arguments are not sufficiently instantiated
I'm not very sure about this either, I haven't used CLP(FD) much. triska, the OP in this thread, is the author of the CLP(FD) library and he could help but I guess he's probably moved on from this thread by now.
>> You're not letting us down! We're collaboratively exploring a fascinating space
of promising ideas!
Thanks, that's a great kind of attitude to have. dang (always a champion of
intellectual curiosity) will be proud of us :)
I did have a quick look through your comment and I think it's insightful and
more correct than not, far as I can tell.
About this:
epimenides :- \+ epimenides.
That is a contradiction and so does not have any model, as you say. While startification
would indeed avoid this, deriving an empty model (or failing to derive a model)
is correct behaviour because the formula is unsatisfiable. However (1) is a
well-formed formula, certainly in First Order Logic, so it's not an invalid
program, per se. In general contradictions and tautologies tend to be treated
specially or ommitted from discussion, in logic programming theory often, I
find, which makes sense because they're kind of obvious corner cases that nobody
wants to spend too much time on. "p if not p" should always fail quickly.
Regarding minimal models, according to the book I linked, there are two datalogs
with negation, datalog¬ (which does not allow recursion) and datalog¬¬ and they
indeed eliminate the guarantee that a program has a unique minimal model under
fixpoint semantics- so while a program is still guaranteed to have _a_ fixpoint
that is reachable in finite time, this fixpoint is not guaranteed to be unique
or minimal.
As to my sketch proof of undecidability, that would actually not work in pure
datalog, where the deletion of facts is not allowed. Deletion of facts is
allowed under datalog¬¬, whose syntax accepts rules with negated _heads_ that
are interpeted as signifying a fact should be deleted. There are different
options for the treatment of negated rule heads (e.g. deriving A and ¬A
simultaneously can be interpreted as a contradiction that makes the program
unsatisfiable) but now the guarantee of termination is lost because there is no
guaranteed fixpoint. This is all from 14 of the Alice book and the example given
is of a program that flip-flops between two models, much simpler than my
three-atom case above, which I think after all is probably fictitious :)
>> I think there's an additional problem as well, corresponding to Henkin
sentences—programs for which models do exist, but for which there is no unique
minimal model.
I didn't know about Henkin sentences and I don't know how datalog (of any kind)
semantics deal with them. An interesting lead for further reading, thanks.
This is interesting and Scryer Prolog looks like a great project. I wonder if it's possible to run it in a DB-like mode - with durable changes, and accessible as a server.
I have been very interested in CHR (constraint handling rules) for this kind of activity. It was developed (I think) to provide a systematic basis for the various constraint libraries to 'speak the same language' instead of being implemented without knowledge of each other.
I don't see much literature out there on it (after the mid-2000s) but it always seemed a conceptually elegant way of doing forward-chaining (and inference) in the way of most rules engines. It only has a few simple rules: simplification (remove data/rules), simpagation and propagation (new data).
The best part is that it embeds very easily in Prolog.
The company I work for, Grakn Labs (grakn.ai), builds a database that does datalog + negation type rules! You can add/remove data as you go, and also add/remove rules in the schema over time.
The terminology the post here uses is different from ours, but roughly we do "Backward-chaining", which starts from the query to the set of facts that are inserted, which the article calls "top-down".
We don't do this right now - optimising for incremental changes is actually better done by maintaining forward-chaining inferences (like the OP) rather than backward chaining, and then invalidating and updating just the inferences that have modified dependencies.
Which you want depends on use case often: if you have often changing data backward chaining is generally better, but if it's largely static you can get large performance wins by materialising all the inferences, once, via forward-chaining.
Isn’t datalog with backward chaining just prolog? Datalog was known for forward chaining (bottom up) while prolog was known for backward chaining, datalog with backward chaining sounds like a misnomer.
You're not wrong - the other big difference between datalog and prolog is that datalog is a subset of prolog's functionality that enables forward-chaining to be safe (gets trickier with negation for example). I was mostly referring to the fact that Graql implements datalog semantics/capabilities, plus negation - but the execution strategy is backward-chaining.
This gives you dynamic facts, feed-forward rule application, and soon even dynamic rules.
It doesn't actually care much about the number of rules, and more about how much memory it costs to materialize them. There are delta-join techniques to do at least most of that in linear memory (w.r.t. input size) though, and DDlog has at least experimental support for them.
Aside from their custom syntax in DDlog, how much semantic overlap is there with the query language of Datomic/Crux/Datalevin/Datascript flavour? Apart from the differential features. This one seems to be big on static type definitions at least.
edit: I guess this explains some of the static nature: "A DDlog program is compiled into a Rust library"
You don't really use dynamic datalog queries with DDlog. Instead you write rules that handle your queries in a streaming fashion ("prepare" statement at compile time, stream tuples of parameters at runtime and listen for outputs), if you can't afford to keep a materialized view up-to-date that turns your queries into simple KV-lookup operations.
But, do not worry:
Support for run-time rule updates is in the works and would allow for arbitrary dynamic datalog queries.
So could you use this to implement always-up-to-date materialized views in Postgres? It says it uses relational data for inputs and outputs. It says it operates in-memory but also that it starts up from a set of known facts (which sounds like it could just read the current matview at startup). This would be a really cool extension for someone to build!
Looks like it’s Postgres protocol compatible rather than running as a Postgres extension?
I’ve been keeping an eye on work that’s been done to support this internally in Postgres. Looks like it’s still moving along. I guess it’s a pretty tricky puzzle (especially once you add aggregates). Depending on the performance implications, it would sit in a pretty sweet spot for some usecases. For example, at the moment I have a form of materialised view that’s simple too slow to build from scratch. My compromise is to run periodically, figure out which subset of the table needs refreshing and update just those rows.
It doesn't run as an extension (at least that's not the typical deployment, and I'd be surprised if it did).
You can feed a CDC to Materialize and either feed the output back to Postgres, or just query it directly in Materialize (via a normal postgres client).
It's very likely that you can cut the typical maintenance latency down below one second. Currently that likely involves using a Kafka broker between the two databases, but feel free to plop in the community Slack [0].
What a horrible landing page. Is it a stand alone product? Does it integrate with my SQL database? Is it a server side application or a client side thing? What is the pricing model? So many questions, zero answers
Am in the process of exploring kafka to do something similar, and I think it will work but requires quite a complex setup (compared to an incrementally updated materialized view)
It's typically integrated with Kafka sources (and often also Kafka sinks), but you can just use any normal PostgreSQL client driver and normal SQL.
It does integrate with your postgres database, but logical replication target mode is still a TODO. Don't worry though, this only means the change data stream has to go through a Kafka broker. There should be a guide on this (if not, plop in the community Slack and ask).
The license allows for free single-node deployments; the SaaS variant is currently in Beta, so you'll have to ask. On-premise clusters for production usage are "contact sales", but AFAIK due to the low sales volume (turns out, vertical scaling is easy and often enough), so don't let that scare you off.
>> It says it operates in-memory but also that it starts up from a set of known
facts (which sounds like it could just read the current matview at startup).
That "it starts up from a set of known facts" refers to the "bottom-up"
evaluation of datalog programs, in contrast with the top-down evaluation in
Prolog.
Consider for instance the following database of facts and rules:
Prolog's top-down evaluation starts with the "head" of rule (1),
grandparent(X,Y) and proceeds to its "body", (parent(X,Z), parent(Z,Y)) to
instantiate the variables {X,Y,Z} according to the facts in the database and
derive the new fact grandparent(maria,yannis).
Datalog's bottom-up evaluation instead starts with the facts in the program:
parent(maria,kostas), parent(kostas, yannis) and adds those to the set of
immediate consequences of the program (TP is called the "immediate consequence
operator"):
Now the "parent" facts in the body of the "grandparent" rule can be instantiated
as follows:
(parent(maria,kostas), parent(kostas,yannis)).
And finally the head of the rule can be instantiated to derive the same new atom
as with Prolog's top-down evaluation:
TP₁ = TP₀ ∪ {grandparent(maria,yannis)}
The advantage of bottom-up evaluation is decidability, as long as a program is
datalog, which means no function symbols other than constants (which are
function symbols of arity 0, i.e. with 0 arguments) and no negation. For
instance, given the following program with a left-recursive clause, datalog's
bottom-up evaluation terminates:
Given this program, datalog's bottom-up evaluation will first add the two facts
of "parent" to the immediate consequences of the program; then instantiate the
fact parent(X,Y) in the body of rule (2) to the two facts of "parent" in the
program; and finally instantiate the head of rule (2), deriving two new facts of
"ancestor":
Then the two facts of "ancestor" in the body of rule (3) will be
instantiated to the facts derived in the previous step and a new fact derived
from the instantiation of the variables in the head of the rule :
TP₂ = TP₀ ∪ {ancestor(maria,yannis)}
And then the program will terminate because there are no more new facts to
derive.
By contrast, Prolog's top-down evaluation will first correctly derive the two
new "ancestor" facts from the head of rule (2) but then will try to evaluate the
first ancestor(X,Z) fact in the body of rule (3). This will take it back to the
head of rule (3) and evaluation will be stuck in an infinite recursion.
The trade-off is that bottom-up evaluation cannot handle negation or functions
(including the cons function used to construct lists) and so is, in that sense,
incomplete, whereas Prolog can do both. For this reason, datalog is preferred as
a database language that offers a limited ability for reasoning (certainly
richer than what is possible e.g. with SQL, that essentially only allows facts)
whereas Prolog is preferred for general-purpose programming.
I wonder how the DDLog compares to Souffle. I think that the biggest difference is that Souffle is batch-oriented. It is less clear how different are the expressiveness, type systems or the recommended patterns. The syntax of DDLog looks seems to be closer to Souffle than to Datomic with EDN.
Very interesting! I started playing with Rust, timely-dataflow and differential-dataflow last weekend, but it was a lot to learn all at once. This looks like it's solved or suggested a path to productionizing differential-dataflow without needing a separate rust service and lots of operation tooling. It would be cool to use something like this for Notion's user-defined databases and queries.
You'll be happy to hear that logic/rule updates at runtime recently transitioned from the "todo... at some time" to the "todo... working on it" state.
Do these databases ever grow beyond what you can afford to keep in RAM/swap?
I'm working towards high-throughput out-of-core support in differential-dataflow, but can't get much actual coding done in my off time. Understanding/concepts/discussions are not affected, so feel free to contact me.
I would love to have the option to suspend computation for unwanted outputs somehow (to disk) so I can scale way, way down with only partial data. My solution for “fits in RAM” is just “shard it”... but really I haven’t gotten far enough in my experiments to need to yet.
> You'll be happy to hear that logic/rule updates at runtime recently transitioned from the "todo... at some time" to the "todo... working on it" state.
What exactly do you mean with "logic/rule updates" ...in dd I can change/modify the actual dataflow?
Rego, the language used in Open Policy Agent, is based on and extends Datalog. It's gaining a lot of traction in the past couple of years for evaluating authorization policies.
There's an indirect relationship between Rego and DDlog, at least people-wise. OPA comes from Styra, which was founded by Tim Hinrichs and Teemu Koponen. Teemu designed nlog, which is also a Datalog variant, for use in NVP, which was the network virtualization product at Nicira and was later renamed NSX after VMware acquired Nicira. Tim also worked with Teemu (and me) on NSX at VMware. And Teemu was one of the architects of OVN (the biggest open source application for DDlog), with Tim also having some influence. And Teemu also knows Leonid and Mihai (the principal authors of DDlog).
Some of the episodes of my OVS Orbit podcast may be relevant:
* Episode 5: nlog, with Teemu Koponen from Styra and Yusheng Wang from VMware (https://ovsorbit.org/#e5)
I don't how much I can say since I work for VMware, but it's finding new uses all the time and I've personally brought it up with other teams.
I'd known of it a while before seeing a presentation on how one service team used it. They had a stream processing problem where they were beginning to lag the input by up to an hour. With differential datalog the lag dropped to sub-second.
I'm told it's primarily used for "networking stuff and big data processing", but "log processing" and "network switches" were also mentioned as (I assume active) use-cases.
Section 4 of [1] lists applications, including OVN, but it appears to be used primarily for experimental or research projects at the moment, but that should change given enough time… maybe?
Yes, and I'm really impressed with what we've been able to do with it in OVN lately. The patches aren't posted yet (probably Monday) but (with Leonid Ryzhyk's help) I managed to take one benchmark from about 110 seconds in the C version of OVN to about 35 seconds with DDlog.
We have other internal-to-VMware use cases for DDlog. The OVN use case is the one that's easiest to evangelize since it's all open source.
Is this sth like linq in c#? Declarative programming over arbitrary structures? From skimming the examples i don't really see the advertised incremental aspect, could someone explain it a bit more?
It's partially inspired by Linq, so the similarity you see is expected.
It's not really arbitrary structures so much, though you're mostly free in what record type you use in a relation (structs and tagged enums are typical, though).
The incremental part is that you can feed it changes to the input (additions/retractions of facts) and get changes to the outputs back with low latency (you can alternatively just use it to keep an index up-to-date, where you can quickly look up based on a key (like a materialized view in SQL)).
This [0] section in the readme of the underlying incremental dataflow framework may help get the concept across, but feel free to follow up if you're still not seeing the incrementality.
If you're talking about building the DDlog-to-Rust compiler, then it's not that hard. Really it's just a matter of installing Haskell then typing "stack build". But each release also comes with binaries for GNU/Linux, OS X, and Windows, e.g. for 0.38.0: https://github.com/vmware/differential-datalog/releases/tag/...
The fact that this is bottom up makes me think that you probably can't go crazy with the number of rules.