more Lemaxoxo's comments

Lemaxoxo · on March 7, 2023

Thanks for the links, I wasn't aware of them. The Russians often seem to have a step ahead in the ML world.

UnCommonLisp · on March 9, 2023

That is an outdated view, it is a global and culturally diverse company.

Lemaxoxo · on March 7, 2023

Yes I know some people look down on that. I hope it doesn't take away the merits of the article for you hehe.

Lemaxoxo · on March 7, 2023

Cheers, I learnt with the best :)

Lemaxoxo · on March 7, 2023

Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.

hobs · on March 7, 2023

In a non-streaming db What would prevent you from storing the result set and just using the last iteration to calculate the next?

Lemaxoxo · on March 7, 2023

This is definitely a possibility. What I meant to say is that the implementation in the blog post doesn't support that.

hobs · on March 7, 2023

Makes sense, just an insert instead of a select, I used a similar approach on newton-raphson to implement XIRR in SQL and it worked well.

swasheck · on March 7, 2023

i’m trying to learn here so please pardon my ignorance. wouldn’t a pre-aggregated set affect the new aggregate result? i suppose you could store avg, sum, count and then add the new value(s) to sum, new count to count, and recalculate average from that. or even just avg and count and then re-aggregate as ((avg*count)+new values)/(count + new values count) but i didn’t know if there’s a better way to process new values into a set of data that’s already been aggregated

hobs · on March 7, 2023

Yep that would be what I would do - effectively a functional approach where no memory is required besides the current set and iteration.

A big part of materializing datasets for performance is finding the right grain that is both easy enough to calculate and also can do nice things like be resumeable for snapshots.

sergius · on March 7, 2023

Did you try running it using DuckDB?

Lemaxoxo · on March 7, 2023

DuckDB is what I used in the blog post. Re-running this query simply recomputes everything from the start. I didn't store intermediary that would allow starting off from where the query stopped. But it's possible!

noloblo · on March 7, 2023

Does this work with Postgres?

Lemaxoxo · on March 7, 2023

Postgres has excellent support for WITH RECURSIVE, so I see no reason why it wouldn't. However, as I answered elsewhere, you would need to set some stateful stuff up if you don't want the query to start from scratch when you re-run it.

gbrits · on March 8, 2023

I for one would welcome a second blogpost with the stateful example. I’ve been doing running totals (basically balances based on adding financial tx per account) but had to do it in some client code because I couldn’t figure out how to do a stateful resume in Sql

Lemaxoxo · on June 8, 2021

I think that is what is being suggested in the other comment. One would have to try! My instincts tell me the results would not be identical.

Lemaxoxo · on June 8, 2021

I suspected this. However, I wasn't able to grok the documentation well enough but I didn't able to find a convincing example. It seems to me that these Python compressors get "frozen" and can't be used to compress further data.

Lemaxoxo · on April 14, 2020

Hey there, great work. I used Cortex as an inspiration when designing chantilly: https://github.com/creme-ml/chantilly, which is a much less ambitious solution tailored towards online machine learning models. Keep up the good work.

Lemaxoxo · on April 9, 2020

Would you happen the memory complexity of this algorithm? I maintain an online machine learning library written in Python, called creme, where we implement online statistics. We have a generic onine algorithm for estimating quantiles, and so a specific algorithm for estimating medians would be welcome. I'm always on the lookout for such online algorithms.

AnthonyLloyd · on April 9, 2020

It's a fixed memory size but gets released when you move into the recursive part. Increasing the fixed part will give you a better starting estimate.

See here for quantiles: https://stackoverflow.com/questions/1058813/on-line-iterator...

The P^2 algorithm used in creme is interesting. For the median it would give a 2 sided median deviation. Maybe this could be changed slightly to be symmetric and give Median and MAD. I'll look into.

Lemaxoxo · on Nov 8, 2018

Are you planning to support multivariate statistics?

radub · on Nov 8, 2018

We plan to support cardinality estimations for multiple columns. Other than that there are no plans for multivariate stats in the near future.

Lemaxoxo · on Oct 9, 2018

Hello everyone,

I'm currently participating in the "PLAsTiCC Astronomical Classification" Kaggle competition. The dataset is rather large (~5M rows) so I decided to write a simple tool to compute aggregate features online. I got really inspired and tidied things up over the weekend. Maybe this can be of interest to some of you.

PS: I'm aware that there are other similar projects out there such as Spark Streaming, but they all feel too bloated and difficult to grok.