Hacker News new | past | comments | ask | show | jobs | submit | more Lemaxoxo's comments login

Thanks for the links, I wasn't aware of them. The Russians often seem to have a step ahead in the ML world.


That is an outdated view, it is a global and culturally diverse company.


Yes I know some people look down on that. I hope it doesn't take away the merits of the article for you hehe.


Cheers, I learnt with the best :)


Good question. I touched upon this in the conclusion. Basically, if you run this in a streaming SQL database, such as Materialize, then you would get a true online system which doesn't restart from scratch.


In a non-streaming db What would prevent you from storing the result set and just using the last iteration to calculate the next?


This is definitely a possibility. What I meant to say is that the implementation in the blog post doesn't support that.


Makes sense, just an insert instead of a select, I used a similar approach on newton-raphson to implement XIRR in SQL and it worked well.


i’m trying to learn here so please pardon my ignorance. wouldn’t a pre-aggregated set affect the new aggregate result? i suppose you could store avg, sum, count and then add the new value(s) to sum, new count to count, and recalculate average from that. or even just avg and count and then re-aggregate as ((avg*count)+new values)/(count + new values count) but i didn’t know if there’s a better way to process new values into a set of data that’s already been aggregated


Yep that would be what I would do - effectively a functional approach where no memory is required besides the current set and iteration.

A big part of materializing datasets for performance is finding the right grain that is both easy enough to calculate and also can do nice things like be resumeable for snapshots.


Did you try running it using DuckDB?


DuckDB is what I used in the blog post. Re-running this query simply recomputes everything from the start. I didn't store intermediary that would allow starting off from where the query stopped. But it's possible!


Does this work with Postgres?


Postgres has excellent support for WITH RECURSIVE, so I see no reason why it wouldn't. However, as I answered elsewhere, you would need to set some stateful stuff up if you don't want the query to start from scratch when you re-run it.


I for one would welcome a second blogpost with the stateful example. I’ve been doing running totals (basically balances based on adding financial tx per account) but had to do it in some client code because I couldn’t figure out how to do a stateful resume in Sql


I think that is what is being suggested in the other comment. One would have to try! My instincts tell me the results would not be identical.


I suspected this. However, I wasn't able to grok the documentation well enough but I didn't able to find a convincing example. It seems to me that these Python compressors get "frozen" and can't be used to compress further data.


Hey there, great work. I used Cortex as an inspiration when designing chantilly: https://github.com/creme-ml/chantilly, which is a much less ambitious solution tailored towards online machine learning models. Keep up the good work.


Would you happen the memory complexity of this algorithm? I maintain an online machine learning library written in Python, called creme, where we implement online statistics. We have a generic onine algorithm for estimating quantiles, and so a specific algorithm for estimating medians would be welcome. I'm always on the lookout for such online algorithms.


It's a fixed memory size but gets released when you move into the recursive part. Increasing the fixed part will give you a better starting estimate.

See here for quantiles: https://stackoverflow.com/questions/1058813/on-line-iterator...

The P^2 algorithm used in creme is interesting. For the median it would give a 2 sided median deviation. Maybe this could be changed slightly to be symmetric and give Median and MAD. I'll look into.


Are you planning to support multivariate statistics?


We plan to support cardinality estimations for multiple columns. Other than that there are no plans for multivariate stats in the near future.


Hello everyone,

I'm currently participating in the "PLAsTiCC Astronomical Classification" Kaggle competition. The dataset is rather large (~5M rows) so I decided to write a simple tool to compute aggregate features online. I got really inspired and tidied things up over the weekend. Maybe this can be of interest to some of you.

PS: I'm aware that there are other similar projects out there such as Spark Streaming, but they all feel too bloated and difficult to grok.


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: