As very long term goal, I want to work on a project like this but on much larger scale.
I have a tiny prototype based on Go and soufflé.
The overall goal would be to figure out classical error conditions like nill pointers deference.
If I can figure out if a pointer will be nil in some execution branch, there is no reason why a computer cannot do the same.
However it is very hard work and the part I am currently stuck in is creating a reasonable abstraction level where I can reason (in soufflé) about code and not about an AST.
I personally would see this as an human race level upgrades. Imagine feeding your code to a CI that spit back something like: "you will have a panic at line 156 when your input is > 4"
Moreover we will be able to check the "real code" starting from main down to all execution paths. Which is much more powerful than current static analysis that usually analyse a single function without considering how the function itself is invoked.
Beside the actual implementation difficulties, the other big elephant in the room is the amount of resources needed to run this kind of analysis.
(I am less concerned about that as computation resources exploded due to AI and solving SAT is getting faster and faster)
No, static analysis can also analyze multiple functions. What you are looking for is called interprocedural analysis and is more expensive than single function analysis (aka intraprocedural analysis). More generally what you want is context-sensitive (knows the calling context) and probably flow-sensitive (knows the control flow). The more precise the analysis, the more expensive they are. There are points-to analysis that can be pretty precise, e.g. https://arxiv.org/abs/1801.09189
> I am less concerned about that as computation resources exploded due to AI and solving SAT is getting faster and faster
Sadly we are usually very concerned about this. The complexity of these problems are usually NP-hard, so we usually have either heuristics (imprecise), exponential algorithms, or fixed-parameter tractable algorithms (practically speaking, O(2^c n) for a pretty large c), which are either imprecise or too slow. This is still an area with active research. The main problem is that, even if your computational power gets 2x faster, you cannot solve problems 2x larger within the same amount of time. It doesn't scale... (but of cause, usually we are dealing with sparse problems so the scaling is not in LOC for example)
> I personally would see this as an human race level upgrades. Imagine feeding your code to a CI that spit back something like: "you will have a panic at line 156 when your input is > 4"
> the part I am currently stuck in is creating a reasonable abstraction level where I can reason (in soufflé) about code and not about an AST
I recommend aiming for an abstraction level similar to the one in the paper (starts on page 9). Once you have that target, writing relations to convert an AST fact such as `binary_op("=", leftId, rightId)` to `Move(to, from)` is fairly straightforward!
If we want to prevent dereferencing null pointers then it seems like our time would be better spent improving type systems and language syntax. Shift the issue left instead of detecting problems after code has already been written.
Racket made a prolog implementation that is currently used in the compiler, or at least I thought I heard that in one of their talks. Might be worth trying, since its model is a language for making languages
> For example when I’m traversing the AST when I’m looking at the node { type: "num", value: 10 }, without doing extra work to go back up the tree I have no way to tell that it belongs to the name “ten”.
There's a very easy way to do it, simply add a pointer from the child back up to the parent. It means you can't share AST nodes but it's damn fast and, once debugged, very easy to maintain.
I've done this on my personal project and so a question to everyone: when I originally heard of using datalog in compilers it seemed like a very good idea, like it was promising some real value somewhere, but I've seen examples that show you can do trivial type checking – but I do that myself and it's very, very easy. Perhaps I'm looking at this wrong? If I'd used data log then I could knock at a few hundred lines of code scattered around the AST objects, but surely I can do more than that with datalog? I feel I'm missing something, and the crowd here seems knowledgeable so any good pointers (good = simple, because I'm a n00b) would be incredibly helpful!
(I am aware of some really advanced stuff like BDDBDDB (https://suif.stanford.edu/bddbddb/), so advanced I can't even follow it, I'm looking for something between the triviality of type checking and the complexity of full dataflow – the problem here may be a personal one, in that I might need to have to write a dataflow implementation to get a really good 'feel' for it, and then I might 'get' the value of using e.g. datalog, I don't know, but any guidance welcome) this will feed into an actual working project).
Oh, this is great. This article and conversation with Marcelle led me down a rabbit hole of Prolog, Minikanren, and Reasoned Schemer.
Not quite out yet.
I am thinking about using this idea of Prolog as AST to build an ai-first, reactive IDE. DM me at @eating_entropy on Twitter or email me at joshcho@stanford.edu if you are interested.
Consider using Datalog (the incredible subset of Prolog) for this perfect use case. Compared to Prolog, you get:
1. Free de-duplication. No more debugging why a predicate is returning the same result more than once.
2. Commutativity. Order of predicates does not change the result. Finally, true logic programming!
3. Easy static analysis. There are many papers that describe how to do points-to analysis (and other similar techniques) with Datalog rules that fit on a single page :O
Souffle[0] is a mature Datalog that is highly performant and has many nice features. I highly recommend playing with it!
Souffle seems to suffer from the standard small language issue of no code examples front and center. Also, no "how is this better than X" or "example of usage in Y".
To those who have used Prolog, is the description sufficient to know what it does and if it will be useful? I've never use prolog and I don't know the lingo.
I was in the same group when I wanted to try Mercury (another Prolog-like language, but with ML-like types).
Someone on Mercury's mailing list took pity on me, and told me to just straight up try some Prolog.
Prolog, at least the basics, isn't hard at all, and it takes maybe a day to get you to write something that works. Of course, every language has its inconsistencies, dark corners and unexpected turns, and, sometimes some very deep and general pieces of wisdom that aren't on the surface. But, really, if you want to understand any Prolog derivative, the best way I can think about, is to learn some Prolog. It's the most efficient way to the goal. It doesn't really take long.
I recommend starting with the simple example[0], then taking a look at the tutorial[1].
But I do agree that being familiar with Prolog is a big help in understanding what Souffle is. For a quick intro to Prolog, I recommend this video[2], which speaks from a "industry practical" point of view, and introduces concepts that almost all apply to Datalog :)
So in the great scheme of programming languages, you had imperative languages like Fortran, eventually there was a paper “GOTO considered harmful” which ushered in a scaling-back of complexity to “structured programming” where you had named cells in which you could store values, and looping constructs like `while` and `for` to walk through them and manipulate them. Then we added on lexical closures over environments, which also take the form of “methods on objects” and “dynamic dispatch,” a bunch of different names for different forms of single-threaded OOP. Languages like GIL-ed Python and Ruby and Node.js continue this tradition today, while other languages add threading to have shared-state concurrent programming, with its vocabulary of locks and race conditions and cache invalidation and transactions and commits and consensus algorithms and now CRDTs...
But, other visions of programming have coexisted with this main educational vortex, including logic programming like Prolog. Actually, if you have heard of any of these alternates, you have probably heard of functional programming, which can be ahistorically understood as saying “hey these locks and races and all that, it's actually very complicated, maybe we could have a simpler model of programming for 90% of our work and only break out the Java when we really need it.” And you’re like “what will you replace it with?” and they respond “Template substitutions! It's all the power of lexicographic scopes, all the power of looping (via recursion), but when you want to express the really complex stuff you need templates that have side effects when substituted, we can give those scary names.” So you give me a sunbeam and a raincloud and I give you back a rainbow.
If that's the functional programming idea then logic programming takes the same idea but adds unification and search. The idea is “ok but why do I have to fill in all the parameters to a template before I use it? I should just say “if light shines on a droplet-cloud it makes a rainbow, a sunbeam is a light, a raincloud is a droplet-cloud, hey language, what are all the things that make a rainbow?”
Functional programming was based on the function, the template, the “I take these args and produce a structure looking like such-and-so.” Logic programming is based on the Horn clause, “X verbs Y, for some entities X and Y that I maybe know about and some verb that I am being introduced to.”
Basic things that logic programming does very well, OOP kind of struggles with. You can kind of understand this if you have written a SQL query. The verbs can be thought of as tables that foreign-key to X and to Y, if you want to write OOP to solve a logic puzzle you have to choose explicit ways to search or brute force... SQL DBs come with B-trees and indexes to solve a lot of these problems. Whereas with Prolog you just write the pattern to match and the new pattern to emit.
It's a rather different take on what the building blocks should be.
Prolog is Turing-complete, and Datalog isn't. In light of this, I think it's a bit of a stretch to call it a (logic) programming language. But people seem to do that.
Turing completeness is a bug, not a feature. After writing code in Agda for some years I was convinced that Turing completeness does more harm than good. Non-TC languages have a killer feature: their halting problem is solvable which means you can get arbitrarily strong guarantees from the compiler. The fact that you can't encode "while(true) { }" without putting "unsafe" doesn't make the language impractical, you can still implement algorithms without Turing completeness.
Moreover, using various forms of coinduction you can encode infinite computation even in non-TC languages.
Agda's been around for 24 years, according to Wikipedia, and it hasn't exactly revolutionized software development: I'm not aware of any notable software written in it, or of anyone so much as winning a programming competition while using it (and some of those are designed to promote functional programming) So, frankly, I'm skeptical that Agda's trade-offs in the performance-convenience-correctness space are good.
Ok, ok I understand theoretically what this means in terms of programming languages. But what does it mean practically? What features are removed? How does this work out when you're implementing a program?
Turing completeness is not a requirement of a programming language! Just because you can't do infinite computation doesn't mean you can't do infinitely useful computation :)
That said, Souffle is touring complete due to its adding of constructors to the language.
> 2. Commutativity. Order of predicates does not change the result. Finally, true logic programming!
I'm not sure this is a good selling point. The fact that predicate order matters in Prolog allows performance optimization and more transparent debugging.
It's a great one! From a logical perspective, it's ideal for the order of constraints to not affect the outcome. This missing behavior becomes a quirk quite early on when working with Prolog.
That said, order of constraints can affect performance. But since reordering doesn't affect outcome, you are free resort them as you wish in effort to optimize for better performance :)
I remember the first time I saw a declarative language, I also immediately saw it as an AST. It’s so validating to see it being explored by other people too.
Asking this earnestly and not as a rhetorical question: what do I get from Prolog that I can't get from using SQL? Is it speed, is it ergonomics, is it the ability to do computations that are otherwise intractable? And if there are any such advantages, at what level of complexity of abstraction do they come up, and at which level would I not see the value of using prolog?
SQL doesn't have homoiconic structure like Prolog or Lisp, which would facilitate metaprogramming.
Prolog and Lisp were derived from first principles, SQL was designed to be a language that reads similar to English (i.e. easy to learn for non-programmers).
Prolog and Lisp also has the concept of "atom" which can represent many things, but in parsing it's used for terminal symbols.
Prolog has the advantage over Lisp that you can define custom operators and their precedence, so you can create easy to read domain specific languages without excessive amounts of parentheses.
The built-in backtracking can also be useful for parsing.
Definite Clause Grammer rules are also a standard feature, which is very similar to the Backus-Naur form.
You're talking about the "how" but I'm asking about the "what" -- and I'm not really sure I agree with much of what you mentioned. SQL "reading similar to english" doesn't really mean much. Homoiconicity is syntax sugar and in practice, you can do the same thing in any vanilla programming language you'd use with SQL anyways (query builder flavored ORMs). It's not really a killer app.
BNF and DCGs get a bit closer to what I'm asking, but even then I'm not sure it really adds anything novel -- what, if any advantages do you get from that over the relational algebra whose set theoretic underpinnings serve as the foundation for SQL? How is backtracking functionally different or more powerful than joining?
Let me give an example by way of another formalism, which is the Church-Turing correspondence that establishes a correspondence between the lambda calculus and Turing machines. Is there any theoretically stronger expressive power one gets from Prolog over SQL, or is it an ergonomics question, similar to using say, Ruby over Java?
Then query if "carl" is the grandparent of "alfred":
grandparent(alfred, carl). --> true
Now, the interesting thing is that I can use the same rule to ask how many grandkids carl has:
grandparent(X, carl). --> X = alfred (and any other kid)
Or, I could ask how many grandparents alfred has
grandparent(alfred, Y). --> Y = carl (and any other grandparent)
Or I could ask for any combinations of kids and their grandparents
grandparent(X, Y). --> X = alfred, Y = carl (and any other combination).
In SQL, on the other hand, you'd have to write each of these queries by hand.
A more powerful example, is defining the rules of suduko and then asking prolog to finish solving a board for you. I will not even begin to suggest how that would be done in SQL (partly because you simply would not use SQL for that).
> In SQL, on the other hand, you'd have to write each of these queries by hand
Nope. Can be written as a non-predicated recursive CTE. Just add a 'where' at the end to specify what you want.
The practical difference is in the way the search takes place. In SQL you would get the right results but it might take a lot longer because it might (depending on what you're looking for) generate all possibilities and just pick the ones you want. In Prolog I think it's a bit more intelligent in searching.
That depends, of course because it's an implementation detail and predicate pushdown or whatever might regain that efficiency, but from experience I can say It Depends.
(I recently did something like this at work, so it is actually useful)
I have to admit I never really got into SQL, and maybe learning Prolog in university ruined me for it, but can someone explain why SQL is considered "relational" when you still have to explicitly do queries for basic stuff like this?
If it's just that you can define relations in the database scheme, then pretty any programming language would be considered relational.
It's relational because it ... sort of... uses relations, though it calls them "tables."
They're not "true" relations though because SQL allowed duplicate tuples ("rows") and missing values (null) which the pure relational model does not permit.
It's also considered relational because it uses at least some of the terminology and functionality of relational algebraic operators (join, project, restrict(select), union, cross product, etc.). Though this sometimes leads to confusion (e.g. "select" in the rel-algebra means something different from what a lot of people writing SQL queries think it means -- 'select' in the relational algebra is a restricting/filter operator, but the way it appears in a SQL query implies that it's the thing choosing which attributes (columns) to retrieve... which is another operation entirely (project) in the relational algebra...)
As for the explicitness, that's orthogonal to the question. Relational algebraic expressions are in fact quite explicit. Datalog queries (and Prolog maybe too, to some degree, but not always) can be essentially compiled down to relational algebraic operations (tho they don't have to be.) And so can SQL.
SQL is sort of ... inbetween... the relational algebra and something like a "relational calculus" which figures more out for you and does things more automagically.
Sometimes the explicitness of SQL is in fact a feature. I've found Datalog queries a bit awkward to read at times, with the joins sort of automagically happening, it's sometimes tricky to fully grasp what's happening behind the scenes.
Regardless one of the features of the relational algebra is that various operations can be re-ordered for efficiency without losing the semantic meaning. So behind the scenes SQL query planners restructure things as they need and desire.
Their theoretical expressive power are the same, they're both turing complete languages. However their practical expressive power is different that comes from language ergonomics, standard libraries and different default optimizations.
SQL with recursive CTEs is Turing-complete. So nothing stops you from writing compilers, rendering Mandelbrot fractals, parsing text, and training neural networks in SQL.
But it violates the relational algebra all over the place in ways that blunt its usefulness and hides its elegance.
Example: Adds nulls (weird trivalue logic, and breaks reordering of some relational algebraic expressions), has duplicate tuples (relations stop being sets of sets).
Unfortunately SQL has done more recently to put people off the relational model than it has to bring them around. Though the majority of people who use it don't actually understand the relational model, so.
Datalog is also built on the relational algebra. Or at least, Datalog works on relations and any Datalog programme can effectively "compiled down" to a relational algebraic expression.
SQL excels at data processing. It's not only a query DSL but also has INSERT, UPDATE, etc., making up a practical batch processing language with natural lock semantics, rollback, etc. SQL also pretty much enshrines a strict DDL/DML separation, has privileges and roles, and so on.
Prolog, OTOH, excels at planning, optimization, constraint solving, etc. Given Prolog was originally designed for old-school grammar/logic-based NLP, it also has excellent built-in support for custom DSLs and recursive-descent parsing using definite clause grammars making use of Prolog's logical variables (with indeterminism / backtracking and retry) and unification (pattern matching on steroids).
Check out Prolog in-browser demos for robotic planning/container logistics and machine learning/bioinformatics (not based on LLMs) at [1]. These aren't problems you'd tackle with SQL ;) but beyond SQL's inappropriateness also require actual language support for debugging, packages, testing, integration, etc.
Unification. This is the central point of Prolog, it's the mechanism behind matching structured data, while in SQL you would have to implement unification for each individual query, afresh, every time. Unification allows you to "not know" a lot of things about the shape of the data you want to match, whereas with SQL in order to be able to match something you need to know precisely how the data is organized.
Another, unrelated aspect: I/O, system calls, all that stuff that would require from you in SQL database to go outside and implement functions that are loaded into your database in some database-specific way.
Parsing. SQL doesn't offer much in terms of working with strings. Writing as much as a parser for arithmetic expressions with just integers would be a huge undertaking. No sane person would write such a parser unless for some bizarre code-golf challenge. It's trivial to do that in Prolog.
CLFD. This isn't part of standard Prolog, but it comes with most implementations. This allows you easily to write queries about numerical facts (eg. find all factors of a given integer).
To continue in the similar spirit as above: modern Prologs come with a bunch of libraries. You can have an HTTP server in Prolog, or an image processing tool etc. Something that in SQL you'd have to implement outside of it, and load as an external function.
Modules. Prolog code is more amenable to reuse and can be better organized because you can put it in boxes, hide things you want to keep private in those boxes etc. This is while SQL operates entirely in some sort of a global namespace.
The list isn't exhaustive, these are just the things I could think in a short time.
Thank you for the thoughtful comment! Comparing these specific features to using, say, a vanilla general purpose PL such as Python with an ORM and a SQL database, it looks to me like the integration of primitives such as parsing, constraint programming and unification are far stronger. In particular, unification seems to be a material one that you mentioned -- to do this in SQL, you'd have to do contortions around an EAV model which could probably get you to the same area you would with prolog unification, but probably much less efficiently. It does seem like the answer to my question is "ergonomics and efficiency" which is a useful thing to know.
I am not sure how you could meaningfully compare Prolog and SQL. Both are declarative languages, but that's where the similarities end. SQL is for interacting with databases, whereas Prolog is, more or less, a language that you can write regular programs in.
Perhaps the Prolog terminology of "queries" and "database" gave you the impression it has something to do with databases that store data. It doesn't. The query is over logical terms, and the database is a collection of Prolog facts.
In SQL you would have to join the "Edges" table with itself an unknown number of times. I suspect this is possible somehow but definitely less natural.
The reachable predicate on your 'naturally undirected edges' treat them as directed;-)
I would rather say that your edges are naturally directed, and then you can add a clause to make them undirected.
I get what you're trying to say, that in Prolog you are able to lookup structures by their shape, and not by some pre-defined path, as in other languages, so you can do both `edge(1, A)` and `edge(A, 1)`.
But the graph in your example is still directed, simply by the virtue of predicate arguments being ordered.
> In SQL you would have to join the "Edges" table with itself an unknown number of times.
In fact fore recent SQL standards have added extensions for these kinds of recursive joins. (And some databases like Oracle, DB2 have had recurse or tclose operators for a long time.)
I don't think they're particularly elegant -- they're awkward like most things in SQL because SQL's syntax sucks -- but the ability is in fact there these days.
How are the two any different? Can you come up with any differences in expressive power between the first-order logic/Horn clauses that underpin Prolog and the relational algebra that underpins SQL?
I wonder whether you're accurately portraying the structural foundations of either Prolog or SQL.
Relational algebra is not Turing complete. SQL gets Turing completeness by means of various additions/alterations (sometimes pretty weird ones about that). So the foundations are pretty different, and I think this is obvious if you tried to start writing typical programs in SQL/Prolog.
Now, some people argue you might want a less expressive logic programming language, especially datalog, so this isn’t a knock-down argument, but it’s a pretty big difference.
Can you double check and/or give references? Because I specifically doubted my memory here, and everything I found online before posting matches what I thought—-standard relational algebra is not Turing-complete (hence relational queries are guaranteed to terminate).
I need to do some more checking, but you are right. The original relational algebra was a subset of first-order predicate logic. Adding recursion I believe just gave it parity with 1st order PL, so fulsome apologies and thanks for pointing out my error.
I get where you're coming from but your point seems flippant. My answer is "debate" -- and looking at the rest of my subthread, some commenters have raised some great concrete areas around Prolog's unique strengths, which I've found helpful!
First of all -- I want to thank you all for robust discussion on this subthread. I've learned a lot about the experience of other folks using Prolog and where they've gotten mileage out of its unique strengths. It's been eye opening to see for sure!
Part of the reason I wanted to ask this question because I had a suspicion that there may not necessarily be as large of a gap between SQL and Prolog as there may initially appear. Said another way, I wondered if there was a Prolog (or Datalog) to SQL compiler used in production.
I recently stumbled upon Logica [1], which appears to based on a predecessor (Yeladog) built at Google. I don't have experience using it, so I'm not sure how complete the implementation is nor how efficient it is. I'm curious if anyone has used it before and if so, what their experience was. At first blush, it appears to bring the best of both worlds: the expressive power of Prolog/Datalog, combined with the standard operational tooling of SQL which is a common core dependency in many if not most production software systems. Two questions:
1) How does Datalog compare to Prolog for common problems you might use Prolog for? Any things you can do in Prolog differentially that wouldn't be as easy in Datalog and vice versa?
2) What issues might come up in system that uses a Datalog to SQL transpiler? Given that some of the answers in this thread about the efficiency of using Prolog to do constraint solving, it seems like such an approach might lose some of the efficiency that makes it so powerful in the first place. On the other hand, if it doesn't, it seems like it'd bring a killer app to SQL, which is the ability to do constraint solving in a nearly native fashion inside SQL.
I have a tiny prototype based on Go and soufflé.
The overall goal would be to figure out classical error conditions like nill pointers deference.
If I can figure out if a pointer will be nil in some execution branch, there is no reason why a computer cannot do the same.
However it is very hard work and the part I am currently stuck in is creating a reasonable abstraction level where I can reason (in soufflé) about code and not about an AST.
I personally would see this as an human race level upgrades. Imagine feeding your code to a CI that spit back something like: "you will have a panic at line 156 when your input is > 4"
Moreover we will be able to check the "real code" starting from main down to all execution paths. Which is much more powerful than current static analysis that usually analyse a single function without considering how the function itself is invoked.
Beside the actual implementation difficulties, the other big elephant in the room is the amount of resources needed to run this kind of analysis.
(I am less concerned about that as computation resources exploded due to AI and solving SAT is getting faster and faster)