Mercurial Ate Our Breakfast [with Revsets], But We Don't Mind

inerte · on July 28, 2010

I don't know if you talked with the mercurial guys before starting SourceQL, but it seems your work overlapped someone else's effort. Everyone could have finished the feature earlier and with better quality.

Next time, my suggestion is to announce what you are doing (to the project maintainers). Be part of the community instead of coming up with something "done" that none knows about.

(sorry if I got the wrong impression about your involvement, but by your story I could not be sure!)

irskep · on July 28, 2010

You got the right impression about our involvement, but the wrong idea about the outcome.

When a professor asks you to do a semester project, you look for an area where work within your abilities can be done. This is what we came up with. We spent a few hours searching for similar projects and found nothing, and it would have been impractical to get in touch with every maintainer of every SCM to see if they were working on a feature like revsets.

We used Mercurial for two reasons: it is distributed, and it has a great hook API. There was no traffic on the mailing list that would have hinted to us that revsets were on the way. As far as I can tell, Matt Mackall himself was just as uncommunicative as we were. The only place I can find revsets mentioned prior to their introduction in Mercurial 1.6 was a mailing list post where he announced them, fully baked and implemented. Can we be expected to be more involved with the community than the primary maintainer of the project?

Anyway, we never felt comfortable with sharing our work that had a nonzero probability of becoming vaporware. We only recently realized that the ideas themselves might be worth something to others, and this is our way of introducing ourselves to the community. I haven't rejoined the Mercurial mailing list yet simply because I have been busy with work and it keeps slipping my mind, but I do want to get plugged in with them.

timtadh · on July 28, 2010

(the other author here)

In addition to what Steve said, this particular "feature" is really just the very beginning for us. Our ideas about what query-able version control mean go way beyond the ability to select arbitrary sets of commits based on commit meta data. We want to be able to do higher order queries. For instance:

For each line of code written by Johnny what is the average life span of a line of code.

Basically how lasting on average are lines of code written by a particular developer. You could combine this query to be only on a particular subgraph forest of the repository (for instance a branch, or commits this quarter). These types of queries are aggregation queries, however there are other classes of queries we would like to support as well. See some of the papers we wrote on scribd for more details. We will be writing a another post soon covering our vision for version control query.

irskep · on July 28, 2010

Author here, taking requests for clarification, suggestions for what to talk about next, etc.

masklinn · on July 28, 2010

Wouldn't the other obvious approach (to storing data in an easily queryable form) be to store it in an actual database? As Fossil does (and apparently Veracity as well, but I'm still not too sure whether it's only storing metadata in a DB or if it's storing everything there) for instance?

irskep · on July 28, 2010

Since it isn't all that slow to just ask Mercurial for batches of data, there isn't really a need to put it into a database, especially since there is an extra cost of updating the database for every change to the repository. The real problem is string searching and concise data gathering syntax, which is what we are focusing on.

In the case of Fossil, it may be possible to access the backend directly and do any selects or joins at that level. While it would be nice to access the data for any system with SQL, for instance, those languages are not suited for working with graph-oriented data, and so the implementation choice comes down to writing a language that transforms user queries into different forms for each system, or writing a language that accesses some underlying database that the repository data has been translated into.

We had been taking the latter approach, but the entrance of revsets pushes us neatly into the former approach. We still have our code to do all the data scraping, so we could use it for example with svn or git (which also has its own subgraph selection mechanism).

There also remains the fact that since we are students, we sometimes like to implement things just for the hell of it. Like databases. We've already got B[+]trees and a parser library, which I'm sure have been written thousands of times, but it's certainly a useful exercise to implement them. Besides, Go is light on libraries right now and it's nice to help them out.

masklinn · on July 28, 2010

> Since it isn't all that slow to just ask Mercurial for batches of data, there isn't really a need to put it into a database, especially since there is an extra cost of updating the database for every change to the repository.

Oh absolutely, my point was that it's simpler if the repo's data is in a DB in the first place.

> those languages are not suited for working with graph-oriented data

Mmmm that's true.

mml · on July 28, 2010

solr?

irskep · on July 28, 2010

The only similarity between Solr and SourceQL is that they scrape data from version control systems. And they index strings. But the goals and access methods are totally different.