This looks like the technology behind Prior Knowledge (http://priorknowledge.com/), which is not surprising given that a lot of the same people are involved. Awesome that it's going open source.
It will be interesting to see how far the database analogy can be pushed. The key thing to realize is that BayesDB is based on a particular model, CrossCat (http://probcomp.csail.mit.edu/crosscat/). If your database table is an adjacency list that represents a graph, for instance, it's not going to perform very well compared to a more tailored model. On the other hand, a generic approach to high-dimensional imputation is very useful.
Eric Jonas here, former CEO of Prior Knowledge, cofounder of Navia, etc. etc. [and obviously not speaking for my employer] Tristan is right, our tech was based on CrossCat, a model originally developed in Josh Tenenbaum's lab at MIT. We fundamentally believe that structured probabilistic models are the future -- this is very much the direction that my latest research has taken, for example.
The reality is that almost any model that lets you build up a good joint density estimate of the data can be useful for this sort of tabular data. CrossCat assumes that your joint factors in to several independent factors -- that is, p(x, y, z, w) = p(x, y) p(z, w), for example. Of course, because it's a dirichlet process of dirichlet processes, all that's learned from the data. But that's not always the best approach, there are others, and it will be fun to see what else comes out of the research community.
And Tristan's also right that this is not always a great fit for your data -- if it's graph based, or if your data are super-high-dimensional, or whatever, it's not a perfect fit. But it is a good start.
For a great review of the Dirichlet Process/CRP, the Indian Buffet Process, and other probability distributions over infinite structures, I highly recommend this great tech report from Zoubin and Tom:
I have some questions that I couldn't immediately answer from skimming either the BayesDB documentation or the paper linked from http://probcomp.csail.mit.edu/crosscat/
* The CrossCut paper seems to focus on binary features and categorical learning. How does BayesDB generalize that? Does it quantize continuous features first to make it all categorical, or does it generalize CrossCut somehow?
* How are we to think about what BayesDB doing? Is the underlying model most similar to a graphical model? A Bayesian network? A Markov field?
* On an informal level, how is the factorization structure learned?
* What's the time and memory complexity in terms of number of features and examples for different operations? Is insertion constant time? Is it storing sparse contingency tables of some kind?
While the original CrossCat paper focused on binary features, it is in fact much more general. For example, CrossCat uses a beta-bernoulli model for binary features, normal-gamma for continuous, and dirichlet-multinomial for categorical data.
CrossCat is a generative bayesian nonparametric probabilistic model. Informally, the generative process assumed by CrossCat is that the columns are clustered (into "views") according to a Dirichlet Process, then the rows within each view are clustered by another Dirichlet Process. Then, the data is generated by the datatype-appropriate component model for each cluster.
INFER, SIMULATE, and INSERT are all constant time, and most other operations scale linearly with the number of rows or columns, including inference. It doesn't store any sparse contingency tables or anything like that -- all it stores are CrossCat posterior samples.
What about the individual cluster models -- can they represent dependencies of variables within a cluster? For instance, I'm thinking about the "salary prediction" example. Is the salary variable considered to be conditionally independent of all the other variables, given the cluster assignment? Or can it learn something like an additive model, where categorical variables are associated with higher or lower salaries within a cluster?
Or to use another example, can it learn correlations between two continuous variables, to solve things like linear regression?
You are correct that each variable is considered conditionally independent of the other variables given the cluster assignment. CrossCat learns additive models and correlations between continuous variables by using many clusters (the clusters don't necessarily have meaningful real-world interpretations).
Ah, good question. Currently, it's implemented to only use CrossCat for predictions.
However, the great thing about the Bayesian Query Language (BQL, BayesDB's extension of SQL) is that it can be implemented by any joint density estimator. So, you could implement BayesDB with Bayes net structure learning, kernel density estimation, or almost anything else instead of CrossCat, if you wanted.
Hi, I'm one of the lead developers on the project. It's a relatively young project, so keep an eye out for lots of substantial releases over the next few months.
So far, we have been focusing on smaller dataset sizes, such as 10,000 rows by 100 columns, but everything scales linearly (both query processing and offline inference) so you could use it for larger datasets too. The current version is all in memory, though, so you're limited there.
Right now you must import data from csv, so no images, and you must do all preprocessing of your data before loading it in. We hope to add more and more of this kind of functionality in later releases. I'd love to hear suggestions! I recommend trying out the VM installation if you want to quickly play around with it.
Here's a suggestion for importing data: support HDF5. It's like CSV, but a lot faster and better. And the set of people who use HDF5 probably overlaps a fair bit with your target audience.
Thanks for your reply. So its doing something akin to clustering on categorical variables. Might be able to connect to some kind of ontology learning...
I'm having trouble getting the examples to work on the VirtualBox VM. run_dha_example gives me "None" every time. I've ran the client.execute with --timing=True and --pretty=False with the same results. Am I doing something stupid? :)
I got the VM up but can't get it to do anything useful.. No next steps after getting the VM up..?? Also, I would be very nice to have a more granular install.. I just spent the day fixing my system after attempting a native install on Ubuntu 13.10 (still have more to repair) A list of prerequisites would be better that the current unified install...
Once you have the VM up, you can try running demo scripts, located at ~/bayesdb/examples//.py after you've checked out and pulled master from ~/bayesdb and ~/crosscat.
Yes, the install process is a pain in the current release (sorry!), but the next release (almost ready) will be much more friendly and granular to install.
Hi, sorry for the trouble, and thanks for the bug report! If you git checkout master and git pull both the crosscat and BayesDB repos (at ~/crosscat and ~/bayesdb on the VM), this issue will be fixed. We are working on pushing a new VM where the proper commits will be checked out already.
Hi Jay,
I was talking with Anne just now and we need explanation on another level, I am afraid. I would like to understand BayesDB, and I wonder if there is a layman's summary available?
It was great seeing you at Thanksgiving at Beachcliff!
Mimi
The answer is both yes and no. In principle, longitudinal/panel data can be run through BayesDB, though BayesDB would have to infer the temporal structure. Also, we've toyed a little with forecasting by making a table where each row is a sliding window.
That said, BayesDB is really about the classic multivariate statistics setting: each row is a sample from some population. We think that a streaming Bayesian database, that models sequences of timestamped UPDATEs to a DB (and with FORECAST in addition to INFER) is an interesting, distinct project that we've done a little work on.
Contact us if you're interested in this kind of data and we'd be happy to talk more.
You can get predicted columns via https://github.com/no0p/alps ... Turns out that adding words like infer to the lex is kind of a big project in pgsql. Excellent work with the language aspect of it in BayesDB -- well thought out.
It will be interesting to see how far the database analogy can be pushed. The key thing to realize is that BayesDB is based on a particular model, CrossCat (http://probcomp.csail.mit.edu/crosscat/). If your database table is an adjacency list that represents a graph, for instance, it's not going to perform very well compared to a more tailored model. On the other hand, a generic approach to high-dimensional imputation is very useful.