Hacker News new | past | comments | ask | show | jobs | submit login
Codeq (datomic.com)
339 points by twism on Oct 10, 2012 | hide | past | favorite | 52 comments



The Smalltalk model (simplifying horribly as storing functions/code units instead of files) has significant and obvious advantages, and it's been around for 30 years.

The fact that the file model continues to dominate suggests to me that there may be significant drawbacks as well.

Possibilities:

* Losing the ability to use file-based tools costs too much. This is less that the function model is bad and more that existing tools happen to be file-based

* The model will eventually win, it's just taking a really long time for thoughts/tools to catch up.

* The model is better, but has not caught on for reason unrelated to its utility. The model is Betamax.

* The file-based model is better


Garbage collection took around 40 years to become mainstream. There are lots of important concepts that the industry still ignores. I don't think that 'computer science being ignored by industry programmers for decades' is evidence of anything except the industry's own technical apathy.

Personally, I think the image model is a huge win for a single programmer, while the file model makes it easier to integrate changes from a big team. (There's not really a "DVCS for images" yet.) When I look around, I see that tools that have won tend to support big teams of less-efficient programmers.

So it's not really surprising to me that, by numbers, the file model dominates. That's not an indictment of the image model.

What do you mean by "better"? There are more Corollas than BMW 5's on the streets here, but does that mean they're better, or worse, or better for a particular use case? I'll take the image model for building a fast prototype any day, even if I have to hand it over to a big team writing C or Java for the final product.


You state that the image model is good for a single programmer, but not large teams.

How much of this do you feel is due to the lack of tools (VCS, IDEs, etc) equipped to deal with the image model, and how much is inherent to the models?


I was careful to say that the image model is great for a single programmer, but not that at was bad for large teams -- simply that large teams are a (current) strength of the file model. Large teams are built of individuals, after all, and there's no reason they can't use images on their own. There's just not much in the way of tooling (yet) to support integration at the image level.

Is this limitation inherent? I think that's impossible to say. How many people accurately predicted the importance of a DVCS, before ever seeing one built? Or for that matter, a garbage collector? In cases where it's not obvious, the tools drive our understanding.

I don't see anything inherent in image models that would prevent this, though languages don't seem designed in a way that would make this particularly easy. For example, right now we're collaboratively editing a (very simple) shared image (HN).


Where do you see the advantages of the image model for a single programmer? I ask because, when I've played around with image based languages, I hated them. However, I've encountered programmers vastly wiser than myself who seem to love it, so I suspect that I'm missing something important.


Perhaps this project shows that file vs image is, in fact, a false dichotomy. Instead of replacing our current file-based workflows with a monolithic image, this kind of tool could provide all the image functionality on top of the file structure. This model is no more difficult for large teams than the file-based one, since in that respect the database/query engine/tools they describe have the same essential ingredients for this as Git has, and one could envision a similar toolchain (push/pull/Github etc. for the database instead of the Git repo) emerging for a system like this. Best of both worlds? In many ways, it feels like the next step for the ideas that make Git great: why not add language semantics and query capabilities and take things to the next level?


As a happy user of vim + fugitive¹, git² serves as a bold abstraction of my code in exactly this fashion. I do not think of files as anything but eponymous containers of namespaces. Git blame, for instance, allows tracking lines across file renames, and tpope's wonderful vim interface actually makes this easy to use.

I enjoy git on the command line, but _tight_ integration with the editor is an almost sublime experience.

So git's well-defined data structure already enables quite a bit of editor magic, but Codeq's extra layer of abstraction is very intriguing. Also, the fact that you can query the engine and get back simple data compares favorably to the mishmash of shelling out and git object parsing done by many tools today.

WRT Smalltalk's image, code management is only one of its marquee features, so I think Smalltalkers would object to calling image vs files a false dichotomy. The image is more akin to a versioned virtual machine than an SCM. I've never known the magic of the image (or a Lisp machine), so I lack an opinion on which is superior.

¹ Drew Neil's excellent video introductions to fugitive: http://vimcasts.org/episodes/archive (April/May 2011)

² and other _content_ tracking DVCSs


> I've never known the magic of the image (or a Lisp machine), so I lack an opinion on which is superior.

None of the Lisp machines really had image-based development. The MIT ones all had an Emacs that worked on text source files. You'd boot into a system image, but that was more like a kernel binary, than the Smalltalk image with a class browser. Interlisp had a structured editor that worked on s-expressions in memory, kind of like the Smalltalk class browser, but these were saved to files (the system would automatically track which definitions were saved in which files, and could tell you about changes: http://larry.masinter.net/interlisp-ieee.pdf).


WRT Smalltalk's image, code management is only one of its marquee features, so I think Smalltalkers would object to calling image vs files a false dichotomy.

Yeah, I agree. That was inaccurate on my part. What I said really only applies to the code management side.


I think the Smalltalk problem is that it is too different. You cannot integrate its features one by one into your workflow, but you basically have to change everything. Porting the features one by one takes a long time and sometimes the benefit seems to be low, because it only provides synergistic effects together with other features.

Another example is Plan9. Some of its innovations were ported (/proc, utf8, ...), but some things probably never will (everything is a file).


I think that Lisp actually strikes a great balance here. The data structures that represent programs are simple and unambiguous, and they serialize to text files absolutely trivially. We should be able to work with code-as-data while making use of text-based persistence and versioning.


Maybe if someone invented a "new" programming language where programs were valid JSON, without mentioning the obvious connection to Lisp, programmers could be tricked into using it.

Programmers seem to love adopting the features of Lisp, as long as they can re-assure themselves they are not actually programming in Lisp.


Srikumar K.S's jexpr is an experimental (and it seems abandoned) attempt at doing just this:

http://srikumarks.github.com/jexpr/

  {do: [
        {fetchURL: "http://some.data.source/",
          timeout: 30.0,
           onfail: {raiseError: "URL not reachable"}},
        {fork: [
            {displayInElement: "element_id"},
            {cacheInLocalStorage: "prefix"}
            ]}]}
more here:

http://srikumarks.github.com/gyan/2012/04/15/j-expressions/

http://srikumarks.github.com/gyan/2012/04/14/creating-dsls-i...


Wow this is pretty great! Never knew about this and might play with it some. Thanks!


MISC might be of relevance here:

http://lambda-the-ultimate.org/node/2887

EDIT: And for the record, I actually think it's a brilliant idea.


I think your first possibility is most likely (which implies that the fourth is too, de facto).

When I started coding professionally I used IBM's VisualAge for Java (which was actually a multi-language tool based on Smalltalk) and it did exactly that: all Java classes were managed as code artefacts held in the IDE's internal repository. You had to export from the IDE if you wanted your standard files-in-directores source/objects. It worked very naturally and when IBM moved to Eclipse instead I found it to be quite a step backwards in terms of working with my code. But the drawback was was you could do exactly what the VisualAge IDE supported and not much more.

The file-based model gives a level of low-cost interoperability that's pretty hard to beat.


The difference is that Codeq appears to be an index, while the storage of record is still a file-based git repo. Eventually you could probably work in either representation if there's reliable round-tripping.


Maybe I misunderstand, but I thought Codeq is more of an analysis tool. Your code would still be files in a git repo.


"The resulting database is highly programmable, and can serve as infrastructure for editors, IDEs, code browsing, analysis and documentation tools."

The mere mention of editors and IDEs suggests to me that they envision analysis as the gateway drug.


The structured model will win. It only needs a real high-quality editor. Not a hokey drag-and-drop toy, but a tool that scales to real-world problems and addresses the way code evolves through time.


I read it and re-read it, and I still don't understand what it does or what advantages it has over... whatever it's supposed to have an advantage over.


One advantage is that you can view your repo's history as a list of new functions being added/removed/changed instead of just diffs of what lines are removed and added. You could add editor support and look at all previous versions of a specific function you're working on.

More specifically, where git mostly looks at code as a collection of files and differences in individual lines, codeq breaks it down into code as a collection of semantic units (function definitions for clojure, possibly class definitions, methods, etc. if other languages are supported). Your version control system would actually understand if a particular method had been moved from a subclass into a parent class, instead of just considering it as lines being deleted here and added elsewhere.


Great explanation. This is one of my biggest frustrations with git.


Here's my take:

I don't think this is an alternative to another product. Instead I think it's something fairly new and interesting: they built a tool which takes your tokenized source code (presumably only in Clojure) and your git log and makes a queryable database out of it. So you could query "who changed my kill_puppies function last year" or other semi-structured things, as opposed to a combination of git blame + ack + ctags.

Like they said in the article, this would be the underpinnings of a larger system; something to manage code deployments, or an IDE, or a code review tool.


Additional analyzers can be added. The vision is definitely not language-specific, and Rich has already done some design work for a Java analyzer.


The vision might not be language-specific, but the implementation will have to be. Furthermore, the language parsers will have to evolve as each language evolves. An advantage of existing versioning systems is that they are language-agnostic. Nevertheless, I am excited to see people pushing in this direction. Keep up the good word!


It appears to be a git enhancement so that you get semantic diffs instead of file line diffs. It also has a query language for browsing code. Much as git is often described as a tool for building a DVCS this would be a tool for building an IDE (it would provide all the code completion and project navigation stuff).


Hmmm, how about...

A smarter and faster version of git-bisect - given an anonymous function, when does the answer returned change working backwards through the codebase?

Or perhaps time warp integration - request a build of project foo but with library bar from the last tagged release version and function baz from a year ago.


I've waited a long time to see a tool that does not view my program code in terms of lines of text. Being a Lisp (Common, Scheme, Clojure) programmer, I always felt I'd much rather see a structural diff — what units of code changed, not which lines changed.

I'm so glad this approach is finally coming, and in what style!


Would love to see some examples of the query output here alongside the example queries.


Yeah, Stu told me I should do that :) In all cases the output from Datomic Datalog is just a data structure. In the case of the query in the blog post, it is just a collection of 2-tuples of date + source code string. The source strings are largish, and it would have bulked up things, so I punted.

The rules don't have output until incorporated in a query - you can think of them as akin to SQL views. However, they don't need to be installed in the db, you can pass them as an arg, as the query does.


I noticed that the analyzer's schema includes an analyzer revision, presumably as a way to allow newer analyzers to be re-run against older versions of the code.

This raises a question for Rich regarding Datomic and the notion of "derived facts" a la "Out Of The Tar Pit":

Datalog Rules can be used to query by some trivial notion of derived facts at any point in time, but most derived facts are expensive to compute and thus should be cached and introduced by a new transaction. In the case of Codeq, this includes the full output of the analyzer. It seems like a natural extension of Datomic to support lazy calculation and caching of derived facts. I could even imagine some cluster scheduling of that work, in a sense producing a map-reduced immutable materialized view of sorts.

I realize I said a question was coming, but I'm having a hard time formulating one... which probably means that I don't understand the problem well enough. So, can you talk a little bit about how you envision Datomic evolving with respect to derived facts?


We don't have any support for materialized views at present, but they are on the list of enhancements to consider.


Obligatory Wikipedia link for those of us (like me) who don't know what materialized views are: http://en.wikipedia.org/wiki/Materialized_view

In short: "In a database management system following the relational model, a view is a virtual table representing the result of a database query. Whenever a query or an update addresses an ordinary view's virtual table, the DBMS converts these into queries or updates against the underlying base tables. A materialized view takes a different approach in which the query result is cached as a concrete table that may be updated from the original base tables from time to time. This enables much more efficient access, at the cost of some data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent queries of the actual base tables can be extremely expensive."


To those who don't see the point of this, I would place it in the context of a larger move towards code "babel" (for lack of a better term), i.e., a unified interface for common abstractions. Imagine that (as Steve Yegge suggests) the tooling available for statically-typed langauges will eventually come to dynamic languages as well (think CEDET). It reminds me of the fact that most web server platforms still focus on emitting text instead of manipulating a DOM. That will change.

What I don't understand is why I need cookies enabled in Firefox to view the post. This is obviously a Blogger issue, as I've seen those little gears many times before. I can't imagine the great Rich Hickey intends it.


Nice work. Props on normalizing human names and use/mention distinction.

Are you going to stick to analysis, or support code transformations? If you're going to transform code, how will you avoid IDE refactor tool hell?

I was struck with the same idea (turning ASTs into git DAGs; normalizing code) back when I was first learning git, but the idea took me down a different path - writing a structured (no-plaintext) programming environment. I'll get to the version control portion soon enough, and I hope there'll be some lessons I can take away from Codeq!


I'm really enthusiastic about the newfound exposure that Datalog is getting! Can anyone inform us of other real world systems using it? What kind of limitations do you run into?

.QL(http://en.wikipedia.org/wiki/.QL) is a similar project to query codebases developed by SemmleCode, that gives a "OO+SQL flavor" to the queries.


LogicBlox (http://logicblox.com/) is using Datalog for real-world enterprise software development. At present, the implementation is far more advanced than any other. See http://www.logicblox.com/technical-reports/LB1201_LeapfrogTr... for example.


the link to your paper is broken


This looks awfully similar to the idea I had since 2009. I never got close to implementing it but I also intended to target Clojure and then Javascript. However I thought about it more in the sense of a nicely structured global open source library. The smallest unit of such a library would be a lisp form. There would be dependency management, version control, documentation and tests for each form. It would be then possible to query the library in plaintext queries like "SHA algorithm" or "vector 3d". The user would be then presented a list of the forms. By looking at the docstrings and tests of the forms the user would be able to choose the most fitting one. Checking out a form would automatically fetch all the forms it requires.

It is nice to see a similar idea actually implemented.


I'm just wondering who thought it was a good idea to put some share links hovering ABOVE the scroll bar. http://imgur.com/xF5UP

I don't know how many times I have lost the scroll handle behind it when this widget is used.


Awesome. Can't wait till it supports more languages


Who said anything about the image model? This is about doing analysis of the git repo. Not replacing the git repo.


This tool sounds like a nice addition to "Light Table".


LT already has a basic form of this, it just interacted with the filesystem directly instead of going through git. I'll play with this some to see what I can make out of it :)


How about NO?)


How about contributing something to the discussion?


Isn't it obvious?

1. It is an IDE's job, to keep track of changes of individual expressions in the code of a project.

2. This information should be stored separately from the source code files, as a meta-data to the project, not the individual files.

3. I don't need a solution for a problem I do not have.

4. The query language is ugly.

5. I do not want to use any "free" commercial service for a solution which can be implemented as a emacs-lisp package.

6. I see nothing in this blog post of any interest.

7. I have no over-excitement just because something comes from Rich Hickey.


Everything looks "obvious" or "not really that important" when you're looking UP the power curve.

I am not a programmer, I solve problems using symbolic representations. Anything that lets me reason about solving this problems at a higher level is a net gain for me. I've been wishing for something like this for the last 13 years. This is fantastic in my mind, I want people to run like hell with it.


> 1. It is an IDE's job, to keep track of changes of individual expressions in the code of a project.

Is the IDE supposed to have it's own revision history stored somewhere? I understand how it might be an IDE's job to recognize individual functions but there's no way it is supposed to keep track of changes. That is entirely the source control's job.

> 2. This information should be stored separately from the source code files, as a meta-data to the project, not the individual files.

What information? This doesn't change your code in any way, it analyzes it and stores the data resulting from that in the datomic db.

3, 4, 6 are entirely subjective, it's rather clear plenty of others find this interesting and useful. I do agree that the query language isn't pretty.

5. Maybe someone will dislike Datomic enough to implement this as an emacs package. The source for codeq itself is open source, datomic is only a storage backend.

7. Let's be straight, judging from the points you made you certainly didn't go into this with no bias. Some of us don't get angry just because it comes from Hickey.


The query language isn't pretty (that's what I thought at first, too!), but it's derived from Datalog/Prolog, AND it is simple (not easy, at first, since unusual), but I'd rather it NOT look too much like SQL. Just like I don't confuse Ruby programming with Clojure.


>I have no over-excitement just because something comes from Rich Hickey.

Ahh, sour grapes and/or jealousy. Yes, it IS obvious now.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: