How would this compare to ingesting this in a locally running rdbms, like postgres, and applying some judicious indexing? Or maybe a local spark cluster?
I'm tempted to benchmark this against better known tech, maybe anyone has some insight to share?
Think you'd want to compare with a columnar database since this is an analytic workload, not a transactional one.
And a 4 node Spark cluster with Parquet or Arrow files and a Scala job will compare favorably to this because it's also a lazy evaluator and the benchmark problems here are embarrassingly parallel.
I've never benchmarked against postgres, but would be interested about the results. I once tried monetdb, and it was orders of magnitude slower for simple calculations, so I stopped looking at RDMS'es after that.
I think they solve a different problem, smaller data, data/relational integrity, data mutations. Dataframes can take shortcuts here.
I am currently doing an internship as part of my masters degree where I am analyzing ~30 GB of data. I'm using Postgres + Python and it is working quite well, even on my 2014 MacBook Air.
It would indeed be interesting to see how this approach with Vaex compares to Postgres. Though, I would be quite sad giving up SQL in favour of Pandas DataFrame indexing and Python looping :)
I think GraphQL is easier in combinations with front end development, and you can tab-complete your way out. Early days for this sub-project, but I think very promising.
I found that python is very slow to use when analyzing billions of entries, if you do a function call per entry, because the overhead of a function call is so large in python (and may be much slower than what's actually inside the function). Even JavaScript can do this much faster.
For simply-vectorizable analysis, sure, but in my work I often have to apply a nontrivial transformation to the data en masse and I'd rather define a single function which defines the transformation on each row to trying to wrestle the problem into one of matrix multiplications and additions.
Indeed, numba, Pythran or cupy can be really useful. In the example of the article, it is 'simple' vectorized math. But in general any function can be added in Vaex. Those that go through numba or Pythran usually release the GIL and can get you an enormous speed boost.
Indeed, numpy for numerical calculations. For strings, we have our own data structure based on Apache Arrow, but we plan/hope to move to Apache Arrow (in combination with numpy), since that's kind of the numpy++ for data science work.
It is not similar to Dask, but similar to dask.dataframe. Dask.dataframe is built on top of Pandas, but that also means it inherits its issue, like memory usage, and performance. (BTW, totally a fan of Pandas).
Xarray is more about nd-arrays, less about ~tabular data.
Vaex is built from the ground up with the idea that you can never copy the dataset (1TB dataset was quite common). We also never needed distributed computing, because it was always fast enough, and thus never had to use dask (although we're eager to support it fully).
Also, vaex is lazy but tries to hide it from you. For instance, if you add a new column to your dataframe, it will only be computed when needed (taking up 0 memory). However, in practice, you're not really aware of that. This means it feels more like pandas (immediate results) than dask.dataframe (no .compute()/.persist() needed).
I would say they are all complementary, with small amount of overlap. Small data: use Pandas. Out of memory error: move to vaex. Crazy amount of data (100TB?) that will never fit onto 1 computer: dask.dataframe, or help us implement full dask support.
The first commit was Jan 2014, when it between 50-80% of my time until 2018 I think where I mostly developed it myself. After that, it is more difficult to say how much time was spend on it, and it wasn't only my time. Although I do most of the development, ideas and discussions with data scientists can be more important than just pure dev time. But it took some time :)
Correction - 99.97% of cash tips don't go recorded. Why pay tax if they never saw it happen?
Compare this to:
> 3.00% of the passengers that pay by card do not leave tips.