Hacker News new | past | comments | ask | show | jobs | submit | Gepsens's comments login

It'll be polars and datafusion for me thanks


I remember 2 years ago someone proposed adding stream processing in datafusion and PRs followed. But IMO stream processing is an entirely different beast, some people could use the sql engine of df for it though. There are rust projects like Arroyo


Creator of Arroyo here—we agree that stream processing is a different beast and needs different infrastructure from a batch engine like DataFusion.

Our approach has been to take pieces of DF (including the SQL frontend and expression engine) but embedding them in our own dataflow and operators. This allows us to support low latency, distribution, watermark processing, and consistent checkpointing.

But the great thing about DF is that it’s designed as a toolkit for SQL-oriented data processing, so it’s relatively easy to pick and use just the pieces you need.


I’ve been messing around with sql and stream processing off and on the last few months via https://github.com/zmaril/bpfquery and then https://github.com/zmaril/zquery, so I very much feel this comment. I didn’t want to build out my own stream processing architecture in bpfquery, it was getting pretty tough pretty fast, so I switched over to a datafusion backend in zquery in the hopes that it could do stream processing well. It can handle static data really well, much better the home grown half engine I made in bpfquery, but streaming sql isn’t easily possible at the moment, everybody is building their own implementations and trying to upstream what they need, no coherent whole from data fusion. I was looking into making an attempt with arroyo sometime, but I think the authors want that code to be used as a standalone binary and not as a library in something else, based on my last impression of it a while back. So, maybe in a few years it’ll be as easy to make a streaming database as it is now to make a normal one, but that’s not the case currently.


I agree. So many disparate solutions. The streaming sql primitives are by themselves good enough (e.g. `tumble`, `hop` or `session` windows), but the infrastructural components are always rough in real life use cases.

crossing fingers for solutions like `https://github.com/feldera/feldera` to be wrapped in a nice database, `https://materialize.com/` to solve their memory issues, or `https://clickhouse.com/docs/en/materialized-view` to solve reliable streaming consumption.

Various streaming processing frameworks often have domain specific languages with a lot of limitations of how to express aggregations and transformations.


> [...] `https://materialize.com/` to solve their memory issues [...]

Disclaimer: I work at Materialize

Recently there have been major improvements in Materialize's memory usage as well as using disk to swap out some data.

I find it pretty easy to hook up to Postgres/MySQL/Kafka instances: https://materialize.com/blog/materialize-emulator/


Yeah I have a feeling something like polars for streaming would be super popular and useful, but it just hasn't happened yet. It's much easier to just do say kafka and a long running python script and write out the transformations by hand, than it is to use anything on the market right now. None of the current streaming processors want to be embedded as far as I can tell, that's not where the money is. They all want to be paid to run it in the cloud for you and follow that vc playbook model. Which, fair! I do think there's a lot of space out that isn't being occupied though and I hope somebody tries to fill it soon.

(As an aside, feldera doesn't want to be embedded into your app, materialize either, and clickhouse might just pull a great streaming library out from nowhere, they seem to be good at just doing stuff like that).


Sharing a few of my own :

# https://github.com/Igosuki/dotfiles/blob/master/git/.gitconf...

grep = grep -Ii

lalias = "!git config -l | grep alias | cut -c 7-"

done = "!f() { git branch | grep "$1" | cut -c 3- | grep -v done | xargs -I{} git branch -m {} done-{}; }; f"

assumed = "!git ls-files -v | grep ^h | cut -c 3-"

lasttag = describe --tags --abbrev=0

lt = describe --tags --abbrev=0

dr = "!f() { git diff "$1"^.."$1"; }; f"

lc = "!f() { git ll "$1"^.."$1"; }; f"

diffr = "!f() { git diff "$1"^.."$1"; }; f"

lb = " !f() { git branch -a | more; }; f"

cp = cherry-pick

st = status -s

cl = clone

ci = commit

br = branch

diff = diff --work-diff

dc = diff --cached

r = reset

r1 = reset HEAD^

r2 = reset HEAD^^

rh = reset --hard

rh1 = reset --hard HEAD^

rh2 = reset --hard HEAD^^

sl = stash list

sa = stash apply

ss = stash save

logtree = log --graph --oneline --decorate --all

lmine = "!f() { git log --branches --author=igosuki@gmail.com; }; f"

purgeforever = "!f() { git filter-branch --prune-empty -d /dev/shm/scratch --index-filter "git rm --cached -f --ignore-unmatch $1" --tag-name-filter cat -- --all }"

updaterefsafterpurge = "f() { git update-ref -d refs/original/refs/heads/master; git reflog expire --expire=now --all; git gc --prune=now }"

ec = config --global -e

up = !git pull --rebase --prune $@ && git submodule update --init --recursive

cob = checkout -b

cm = !git commit -m

save = !git add -A && git commit -m 'SAVEPOINT'

wip = !git add -u && git commit -m "WIP"

undo = reset HEAD~1 --mixed

amend = commit -a --amend

wipe = !git add -A && git commit -qm 'WIPE SAVEPOINT' && git reset HEAD~1 --hard

bclean = "!f() { git branch --merged ${1-master} | grep -v " ${1-master}$" | xargs -r git branch -d; }; f"

bdone = "!f() { git checkout ${1-master} && git up && git bclean ${1-master}; }; f"

        pr = pull --rebase
I'd advise binding things like pr, po, cp, --rebase, --continue, to keyboard shortcuts though if you are in an IDE.


I think you need some kind of autocomplete here to make it worthwhile


Why is the level so low in software engineering in general? Well, at least I know I'll never be out of a job.


Ok so LLM bots are on HN now... scary


How is your refactoring clean ? Lmao if this is clean code I'm the Queen of england...


Everybody has different taste and opinions what "clean code" is supposed to look like (best example of this is "for-loops" vs .map()" - IMHO nested loops are usually more readable than a function chain which does the same, but other people have the opposite opinion).

IMHO the core problem of the situation described in the article is that the guy simply rewrote a piece of code without before talking to the author/maintainer/owner of that piece code.


The thing is, majority of ppl cannot evaluate this too and we end up with code with clean code ^tm traits that sucks


Can you share how you would clean it up?


There is a signal api already


Sure but that's completely different from requiring the Signal app to allow exchanging messages with Facebook/WhatsApp/iMessage/…


I read a while ago that the regulation will require them to build an API that other apps can access it, so you could use an alternative app if you want to. I'm not sure if they will force them to receive messages from other services.


Pandas is still way too slow, if only there was an intégration with datafusion or arrow2


Polars is much faster, maybe that's interesting for you.


Yeah I already know about Polars, good and lightweight but datafusion is more advanced


Foreign companies get fined billions for doing the same when it doesn't please Uncle Sam...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: