If you are struggling to understand the README, I highly recommend the book Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath. Although the examples are in R, the same concepts apply to Pyro (and NumPyro)
it's not that complicated that you need to read a whole book just to get a rough idea: it's just a cutesy way to specify a "plate model"[1] and then run inference using that model.
In some downstream applications such as filtering data (say, good/bad), I am training simple NN classifiers based on relatively small datasets. So, my personal confidence in the classifier is not so high, I'd like to reject things that are "definitely bad" and keep anything that may be good. Even more, I'd like to put aside "maybe good" data for human verification and keep "definitely good" data.
In other words, I think I have a practical use case for calibrated confidence scores, which I definitely don't get from my NN classifiers. They are right a certain percentage of the time, which is great, but when they are wrong sometimes they still have high confidence scores. So it's hard to make a firm decision based on the result without manually reviewing everything.
So my question is: is this an appropriate use case for PyRo? Will training my NN classifiers blindly converted to probabilistic classifiers and sampled appropriately give me actually reliable and useful confidence scores for this purpose? Is that the intended usage for this stuff?
MCMC is AFAIU overly prohibitive for neural networks. If you are interested in incorporating uncertainty awareness and improving calibration of your neural net classifiers in a somewhat scalable manner, I think linearized Laplace is a good place to look. I'd suggest you look at `laplace-torch` [1], it's an easy way of doing that.
Never heard of that, I'll take a look thanks. What do you mean by overly prohibitive? I'm not talking about big networks here so running them many times for inference is not a big deal, although I admit I'm not sure what numbers we are talking about. 100s to 1000s of times is feasible though.
I'm not an expert on MCMC for BNNs by any means, but even with small networks I think it's a little tricky to get right. If my memory serves me right, this paper[1] focuses on small networks and goes over the issues and how to get around them.
For those with more experience how does (Num)Pyro compare with PyMC? I haven’t had the good fortune of working with any of these libraries since before Pyro (and presumably numpyro), and with PyMC3 back when it used Theano under the hood.
Are the two libraries in competition? Or complimentary? I’ve been playing with PyMC for a personal project and am curious what I might gain from investigating (Num)Pyro?
I would say that, at least for me, PyMC’s main advantage was in DX. I just found model construction much more straightforward and better aligned with how I wanted to assemble the model.
I tired both a while back, but nothing too big or serious. One thing that numpyro benefits from is JAX's speed, so it might be faster for larger models. Though PyTensor, which is the backend for PyMC can apparently also generate JAX code, so the difference might not be drastic. The PyMC API also seemed to me easier to get started with for those learning Bayesian stats.
One thing I remember that I disliked about PyMC was the PyTensor API, it feels too much like Theano/TensorFlow. I much prefer using JAX for writing custom models.
You'll lose a lot of the PyMC convenience functions with Numpyro but gain a lot of control and flexibility over your model specification. If you're doing variational inference, Numpyro is the way to go.
You can use the Numpyro NUTS sampler in PyMC with pm.sample(nuts_sampler="numpyro") and it will significantly speed up sampling. It is less stable in my experience.
This is maybe not the place, but we did some apples to apples comparisons between PyMC, Dynesty, and the Julia Turing.jl package.
A little to my surprise, despite being a Julia fan, Turing really outperformed both the Python solutions.
I think JAX should be competitive in raw speed, so it might come down to the maturity of the samplers we used.
I agree with you on the PyMC situation. There has been many changes to the backend engine, from Theano, to TensorFlow, back to Theano, then JAX and so on, that it becomes a little confusing.
PyMC can use NumPyro as a backend. PyMC's syntax and primitives for declaring models are much nicer than (Num)Pyro's, as is the developer experience overall. But those come at the cost of having to deal with PyTensor (a fork of a fork of Theano), which is quite bad IMO, instead of just working with Numpy or PyTorch.
Related question: are there there any algorithms / optimizations for probabilistic programming in an online context?
What i mean is that I have a model and I've run my inference on my historical data. Now I have new observations streaming in and I want to update my inference in an efficient manner. Basically I'd like something like a Kalman Filter for general probabilistic models.
There are a class of probabilistic models that have exactly this property -- exponential family models. While pedagogic examples of such models tend to be very simple, they need not be. A huge class of graphical models fall in this class and can be very flexible. The underlying statistical model of a Kalman filter is in this class.
These models have what are called sufficient statistics, that can be computed on data, s = f(D) where s is the sufficient statistics, D is the past data. The clincher is that there is a very helpful group theoretic property:
s = f(D ∪ d) = g( s', d) where s' = f(D). D is past data, d is new data, D ∪ d is the full complete data.
This is very useful because you don't have to carry the old data D around.
This machinery is particularly useful when
(i) s is in some sense smaller than D, for example, when s is in some small finite dimension.
(ii) the functions f and g are easy to compute
(iii) the relation between s and the parameters, or equivalently, the weights ⊝ of the model is easy to compute.
Even when models do not possess this property, as long models are differentiable one can do a local approximate update using the gradient of the parameters with respect to the data.
⊝_new = ⊝_old + ∇M * d.
(∇M being the gradient of the parameters with respect to data, also called score)
With exponential family models updates can be exact, rather than approximate.
This machinery applies both to Bayesian as well as more classical statistical models.
There are nuances also, where you can drop some of the effects of old data under the presumption that they no longer represent the changed model.
The algorithmic family that in general seems called for is "sequential monte carlo" or "particle filtering". While often this is presented for models and problems where the latent dimension is small and the problem is about how something evolves over time, the "sequential" part of it can just be about the order in which data are received (I think this traces to "A Sequential Particle Filter Method for Static Models", N Chopin, 2002).
On the PPL side, I think SMC has often been a secondary target, but there has been good work on getting good, efficient inference be more turn-key. I think this has often focused on being able to automatically provide better/adaptive proposal distributions. For example, "SMCP3: Sequential Monte Carlo with Probabilistic Program Proposals" by Lew et al 2023 (Stuart Russell is among the contributors).
One way I've seen this done in practice is to construct an offline model that produces an initial set of posterior samples, then construct a second, online model that takes posterior samples and new observations as input and constructs a new posterior. This probably wouldn't make sense computationally in a high-frequency streaming context, but (micro)batching works fine.
Update the model with the new data and recompute it (hopefully the unchanged parts are cached). Most probabilistic programming models are not black boxes like neural nets.
What is probabilistic programming actually useful for? Interpretable inference or something like that?
It seems like if raw prediction of a data distribution is what you are interested in, explicitly specified statistical models are probably less useful? At least if you have lots of data and can tolerate a model with lots of 'variance'
> What is probabilistic programming actually useful for?
You can think of a probabilistic programming language as a set of building blocks for building statistical models. In the olden days, people used very simple frequentist models based on standard reference distributions like the normal, Student's t, chi2, etc. The models were simple because the computational capabilities were limited.
In modern days, thanks to widespread compute and the inference algorithms, you can "fit" a much wider class of models, so researchers now tend to build bespoke models adapted for each particular application they are interested in. Probabilistic programming language are used to build those "custom" models.
Yeah interpretability is the main reason that I know. If you need to make decisions based on statistics of the data, I think it's a lot easier with explicit statistical models.
[1] https://www.goodreads.com/book/show/26619686-statistical-ret...