Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.
i’m trying to learn here so please pardon my ignorance. wouldn’t a pre-aggregated set affect the new aggregate result? i suppose you could store avg, sum, count and then add the new value(s) to sum, new count to count, and recalculate average from that. or even just avg and count and then re-aggregate as ((avg*count)+new values)/(count + new values count) but i didn’t know if there’s a better way to process new values into a set of data that’s already been aggregated
Yep that would be what I would do - effectively a functional approach where no memory is required besides the current set and iteration.
A big part of materializing datasets for performance is finding the right grain that is both easy enough to calculate and also can do nice things like be resumeable for snapshots.
DuckDB is what I used in the blog post. Re-running this query simply recomputes everything from the start. I didn't store intermediary that would allow starting off from where the query stopped. But it's possible!
Postgres has excellent support for WITH RECURSIVE, so I see no reason why it wouldn't. However, as I answered elsewhere, you would need to set some stateful stuff up if you don't want the query to start from scratch when you re-run it.
I for one would welcome a second blogpost with the stateful example. I’ve been doing running totals (basically balances based on adding financial tx per account) but had to do it in some client code because I couldn’t figure out how to do a stateful resume in Sql
I suspected this. However, I wasn't able to grok the documentation well enough but I didn't able to find a convincing example. It seems to me that these Python compressors get "frozen" and can't be used to compress further data.
Hey there, great work. I used Cortex as an inspiration when designing chantilly: https://github.com/creme-ml/chantilly, which is a much less ambitious solution tailored towards online machine learning models. Keep up the good work.
Would you happen the memory complexity of this algorithm? I maintain an online machine learning library written in Python, called creme, where we implement online statistics. We have a generic onine algorithm for estimating quantiles, and so a specific algorithm for estimating medians would be welcome. I'm always on the lookout for such online algorithms.
The P^2 algorithm used in creme is interesting. For the median it would give a 2 sided median deviation. Maybe this could be changed slightly to be symmetric and give Median and MAD. I'll look into.
I'm currently participating in the "PLAsTiCC Astronomical Classification" Kaggle competition. The dataset is rather large (~5M rows) so I decided to write a simple tool to compute aggregate features online. I got really inspired and tidied things up over the weekend. Maybe this can be of interest to some of you.
PS: I'm aware that there are other similar projects out there such as Spark Streaming, but they all feel too bloated and difficult to grok.