I'm a little confused with Arrow being included in the comparison.
What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?
But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another.
Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? https://duckdb.org/docs/data/parquet
Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format.
Exactly what of SQLite, DuckDB and Arrow is being compared here?
I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled)
Perhaps I'm missing the trees for the forest?
No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking.
Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is)
> What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?
This question comes up quite often. Parquet is a _file_ format, Arrow is a language-independent _in-memory_ format. You can e.g. read a parquet file into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).
Another way of looking at it, if you have a C++ background, is that (roughly speaking) it makes C++'s coolest feature - templates -, and the performance gains obtained by the concomitant inlinability of the generated code - available in other languages. For example, you can write `pa.array([1, 2], type=pa.uint16())` in python, which translates roughly to `std::vector<uint16_t>{1, 2}` in C++. But it's not quite that; Arrow arrays actually consist of several buffers, one of which is a bit mask indicating whether the next item in the array is valid or missing (what previously was accomplished by NaN).
While I'm not a huge fan of Arrow's inheritance-based C++ implementation (it's quite clunky to say the least), it's an important project IMHO.
Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.
Makes sense. I should have included this functionality in my description of the value Arrow brings:
> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).
Very powerful indeed.
You lost me here though:
> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!
What is already being used for what?
The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?
Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).
And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.
Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.
>Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.
That's like comparing SQLite to Scala because Spark is written in Scala and exposes a SQL interface.
This isn’t a comp sci geek doing performance analysis. It’s a guy who wants to do stats, and then throws up 3 options and discusses tradeoffs - memory and time consumption. He got his little workflow from a few minutes to a second, and that is what matters to him.
I was also confused at the headline but felt the comparison made sense by the end of the article.
> Why not at least benchmark the query execution alone instead of startup and loading of data?
Because they don’t intend to have R running with 4gb of data in memory constantly.
I’d assume there’s a SQLite /duckdb instance running, but that it is negligible. Maybe the test even includes starting the database server.
So the perspective is “I want some data on my hard drive, and I want to query it from R every once in a while, what are my options?”
And suddenly a file format becomes comparable to a database.
Another way to look at it is to consider the compiled arrow code in the arrow library as a database engine. Then arrow is just another alternative to SQLite.
I get what you're saying. This is their local workflow. Makes a little more sense looking at it this way. But...
In my opinion, the point about formats remains and the purpose of Arrow is lost.
I don't know enough about Arrow, but surely there is a better storage format than Parquet, all though storage isn't primary consideration for Arrow. The purpose of Arrow is not to have to convert from one format to another. Data can be efficiently transferred from RAM across the wire to RAM again without any significant transformation.
Surely there is a storage representation for Arrow to deliver similar characteristics as its intended use? Eg:
disk (arrow format?) -> RAM (arrow)
instead of
disk (parquet) -> RAM (parquet) -> CPU (transform) -> RAM (arrow)
There is a penalty (time and memory) dealing with parquet (or any intermediary format) and then transforming it to Arrow. What's the point of using Arrow if this is what you're going to do? Just use a parquet library instead of arrow (it's unclear to me what is actually performing the query in the Arrow step? Is it the R dataframe query, or did it push the query down to the Arrow data engine?).
After all, isn't this exactly how SQLite is being tested in this case? The original data file is loaded into SQLite and stored in SQLite's native file format providing all of the des-ser advantages SQLite provides out of the box. Not to mention the indexing that's defined as part of this preparation.
I think the best way of looking at this is arrow-format = feather-format
I think the format layouts are the same, the main difference between them being compression. Compressed data can in many cases be faster than uncompressed data to read/scan, (https://stackoverflow.com/questions/48083405/what-are-the-di...) so the cpu transform above is simply an uncompress step, which is notably simpler than what would go on with Sqlite to transform the data to a sqlite structure.
What we're looking at here in this test is a direct-data-access pattern (DDA), which is important as you can avoid ETL caching steps eg. Parquet->Postgres (which have an ingestion time) if you can access the data quick enough for your use case, and if the data is on say s3, you can have multiple (parallel) readers onto the same data rather than a connection pool for databases.
It also allows joining different larger-than-memory datasets efficiently, and avoids much of the infrastructure costs for something like Presto (Athena has per-query costs instead of infrastructure costs).
What I'd have liked to see, would be a Vaex/DuckDB benchmark, as the main differentiation between them appears to be SQL vs df/linq/dyplr/Rx semantics.
The Parquet libs I’ve used let you read the Arrow representation straight out of a Parquet file.
There’s also an Arrow “blob” format that can be used, but that’s not meant to be used as a full persistence format (as I understand)
The value proposition is that increasing numbers of query/analysis “engines” support the Arrow in memory format which is fantastic for interoperability.
“I need a data frame library like Pandas but I’m not in Python” is no longer an issue.
I think it works best when you have multiple languages in one process. Eg a Java app passing data to an embedded python script, where both languages manipulate the same data structures in memory.
There isn't a daemon to speak off, they are shared libraries that you load and call functions on.
I don't know much about R, but if it considers the whole memory space of the R process, it probably counted the memory consumption of the libraries themselves as well.
The memory profiling only measures allocations on the R heap (i.e. objects managed by R's GC). It doesn't measure allocations by other libraries running in the the same process, unless they use R's allocation machinery.
I assume Parquet has a close relationship with Arrow (project wise) and the parquet libraries I’ve used all supported reading the Arrow representation straight from the Parquet file.
I imagine they’re all getting compared because 2 of them are “embedded” relational DB (even though their data structure layout is optimised for different things) and the remaining option appears to present similar benefits and functionality in a different style.
I'd bet that doing the same with Pandas would require time and space similar to RDS (1). I really hope DuckDB makes it in the Python world, everything I read about it seems very promising. Using it myself for toy projects was pleasant, too.
Of course, since the memory allocation comes from Rprofmem (via benchmark::mark), this only measures allocations of memory for objects on R's heap. Allocations by C extensions (like DuckDB and SQLite) aren't tracked. They're surely _more_ space efficient than just reading everything into RAM, but perhaps by a smaller margin than shown here.
Here’s hoping that DuckDB will add support for spatial data, indexing, and query predicates. It would be great if this were a first-class feature instead of bolted on like SpatiaLite is to SQLite.
Are you a contributor? Is there a way to bring this up with the team? Perhaps a place to start would be to add support for storing OGC Simple Features and corresponding to/from conversion functions. I don’t have a lot of spare time, but may want to take a stab at a proof of concept if some developers could help orient me to the code.
DuckDB developer here - we absolutely welcome outside contributions. Feel free to open an issue or discussion on our github for a feature request, and we would be happy to point you in the right direction!
I thought pandas was in some sense evolving towards arrow? (For those who aren’t aware, they share a co-creator).
Edit: that said a benchmark would be worthwhile, and similarly the tidiverse should evolve towards arrow speed I hope since they also share a co-creator.
Lots of confusion derives from the best-of-breed parquet readers for Python and R residing in the Arrow packages, mostly because Arrow is (and does) a lot of things.
There's:
* arrow, a in-memory format for dataframe-shaped things supporting fast computation, zero-copy sharing, etc.
* arrow Feather V1, an on-disk format for dataframe-shaped things
* arrow IPC, a (de)serialization for arrow buffers and a protocol for sending/receiving to other processes.
* arrow Feather V2, an on-disk format that's basically the IPC serialization written to file[1]
* arrow Flight, a protocol for requesting/sending/receiving data to remote machines that's basically gRPC layered over the IPC format
* arrow DataFusion/Ballista, nascent system(s) for local/distributed query execution over arrow buffers
* other subprojects I'm surely forgetting
* a (very good) C++ Parquet reader [2]/[3] developed under the auspices of the project
* libraries in / bindings to many languages, including R and Python, supporting interaction with (subsets of) the above.
It's only the last piece that's exposed to most data-science-y users, and thus identified with the 'arrow' name. Since those libraries are also very good, and hiding their abstractions well, those users are free to use the functionality relevant to them, be it dealing with parquet, feather, etc. without needing to understand how they work.
Not that this is a criticism of the project, or those users! Arrow encompasses lots of different functionality, which enables it to provide different things to different people. As a result, though, 'Arrow' connotes lots of different things (and different _sorts_ of things) to different users, which can cause some confusion if terms aren't fully specified, or even a bit misunderstood
DuckDB is faster than SQLite on an analytical type of query. That's expected because DuckDB is column-based and designed exactly for this type of queries.
The analysis would be more telling if it measured insert performance as well.
Kind of. The dataset is small and he probably need to create the database at most once per day and have to read a lot more for his analysis/exploration.
It does work well.
In fact, probably it's more common for people to use external formats like Parquet or Arrow than using its internal format since the internal format isn't stable yet.
I am certain I'm in the wrong here but I'm struggling to understand Arrow's USP. I (originally) assumed it meant python/R users would be able to get around memory limitations when model-fitting but all the examples I've come across are just data manipulation and none of the main modeling packages support it. Those who are using it, what am I missing?
Arrow eliminates ser/der, and if all actors in the workflow use the format you could see drastic performance improvements for a wide variety of workloads. I've seen ser/der for multi-GB+ processes take up half of the total clock time of the task.
Arrow is a language-independent memory layout. It’s designed so that you could stream memory from (for example) a Rust data source to a spark/DataFusion/Python/whatever else/etc with faster throughout and support for zero-copy reads, and no serialisation/deserialisation overhead. Having the same memory model ensures better type and layout consistency as well, and means that query engines can get on with optimising and running queries rather than also having to worry about IO optimisations as well.
I’m using DataFusion (via Rust) and it’s pretty fantastic. Would love to swap out some Spark stuff for DataFusion/Ballista stuff at some point as well.
As a data data scientist, I also found this pretty confusing so I spent some time trying to understanding it better. I wrote it up as a blog post:
Demystifying Apache Arrow
https://www.robinlinacre.com/demystifying_arrow/
My use case is that since Arrow keeps all data types and dumps the in memory table to disk, this allows me to backup my work and later reload the data and keep going. Loading and writing the data to disk is very fast It’s much better than using hdf5 for me in that regard.
And in most cases, if you memory map (mmap on Linux/BSD, MapViewOfFile on windows) it’s way faster than reading the file - because you only ever read what’s needed on one hand, and it stays in cache between invocations.
I enjoyed this comparison, thanks! Here is a related generally R-centric comparison that you might enjoy of DuckDB, dplyr, data.table, etc. applied to five data-sciency problems I wrote up a few months ago: https://github.com/bwlewis/duckdb_and_r
>dbSendQuery(con, "CREATE INDEX year ON yrpc (year)")
>dbSendQuery(con, "CREATE INDEX reporter_iso ON yrpc (reporter_iso)")
would sqlite use these for the query given?
it seems they wouldn't help with grouping, and the grouping seems antagonistic to filtering. if you're trying for performance, you can avoid having the query engine refer back to the original table during processing by using a covering index that includes the data fields in addition to the query fields
"CREATE INDEX covering (reporter_iso, year, trade_value_usd_exp, trade_value_usd_imp)"
or maybe even something like
"CREATE INDEX covering (reporter_iso, year, reporter_iso, trade_value_usd_exp, trade_value_usd_imp)"
Though I'm not sure how it's query planner would do with a duplicate column in the index.
It was an offhand. I was thinking to keep them in the same order as the group-by while still keeping the field in front for the initial filter, but given a moments thought I doubt it would make any difference, besides wasting some space in the index.
Does the R code, for eg., SQLite, actually build an SQL query?
This just looks like a tidyverse library comparison. I'd expect a benchmark using their own libs, rather than assuming tidyverse will have an optimal way of querying them.
Yes, it does. The ´collect()´ function at the end, is responsible for executing the previous chain in the database.
The fact that ´tbl´ in the beginning is called with a connection, ensures that all the following functions don't execute on a dataframe, but instead builds up a query.
R, kinda like Julia, executes different versions of a function, depending on the type of the first parameter.
From an API point of view, I think it is absolutely ingenious!
There's an R package for that! Well, multiple really. The `sqldf`[1] package has been around for a while, there's `tidyquery`[2] and also one I wrote called `duckdf`[3]. All three support writing SQL queries to directly access dataframes.
TLDR: Arrow and DuckDB provide fast database aggregates compared with R's RDS format and, to an extent, SQLite.
It is unclear how much functionality is available for Arrow under R: any comments? It would also be interesting to see a similar benchmark for Python, which could include the embedded version of MonetDB -- an R package for MonetDB/e is not yet available.
Edit: amended the TLDR to reflect jhoechtl's and
wodenokoto's comments. SQLite provided reasonably memory efficient aggregates.
RDS is slow to load, because it has to unzip and read everything into memory. All others are fast to load because they have somesort of index into data on disk (at the cost of being much larger at rest)
Everything else is fast to load compared to RDS, with arrow being the fastest because its index happened to be optimized for the test query.
> TLDR: Arrow and DuckDB provide fast and memory efficient database aggregates compared with R's RDS format and SQLite. Arrow and RDS were fast to load.
I nowhere read that the authors come to the conclusion that SQlite does NOT provide fast and memory efficient database aggregates. Instead they conclude
> SQLite and DuckDB files consists in a single large file (3 GB and 2.5 GB each), but the indexes we created allow their respective packages to read a copy of the tables that has just the year and reporter_iso columns, and therefore allows very fast filtering to provide the exact location of what we need to read in the large tables.
Although this article is focusing on dplyr which provides an uniform API over various data sources, I guess the choice depends more on the personal taste: Do you like staying on R / Python or on SQL?
If you prefer SQL, using raw Arrow or in-memory data doesn't make sense - Except duckdb does support SQL queries over these non-db data sources as well.
Anyway, R data ecosystem very rich for both R and SQL fans and I as a Python user feel a bit envious.
I'm a bit surprised the DuckDB file size is that large (5x Parquet, and nearly as large as sqlite). I haven't used DuckDB, but I thought it had columnar storage and so would have better compression.
DuckDB developer here, compression is still a work in progress and is being incrementally implemented and added to the system. For now the storage is still mostly uncompressed, but expect this to improve in the coming months.
DuckDB developer here, DuckDB has a query engine that can directly query external data formats (stored in CSV, Parquet, Arrow, Pandas, etc) without loading the data directly, but also has its own columnar ACID-compliant storage format.
It can certainly be used in the same manner as DataFusion, but can also serve as a stand-alone database system. DuckDB aims to have much more comprehensive SQL support beyond only SELECT queries.
Not an advice, but you should probably consider spinning a secondary product from DuckDB with a sole focus on "reading data from parquet files and running aggregations the most efficiently possible". You can probably skip INSERT, UPDATE, DELETE completely.
There is currently a gap in practical solutions for this pain point. You can use Spark or Airflow, but nothing that comes without a big infra price tag (you can do that with pandas, but you need a large instance to load the entire dataset in memory). I think the right product could even outpace what you currently have with DuckDB.
I don't know what a DataFusion is, or what Arrow is, but DuckDB is, in my understanding (and use of it), an "almost" drop in replacement for SQLite, except with better performance.
Anytime you might want to use SQlite as an actual RDBMS (not as an easy disk fileformat), drop in DuckDB instead and get much better performance. DuckDB is actually designed to support OLAP.
> Anytime you might want to use SQlite as an actual RDBMS (not as an easy disk fileformat), drop in DuckDB instead and get much better performance. DuckDB is actually designed to support OLAP.
Aren't RDBMS good at both OLAP and OLTP while not being the best at each?
I don't think it's the case, but they serve the same purpose anyway: inject the CSV or parquet files you got from the data wharehouse into something you can use for analytics.
Memory-based storage is super fast and fine until your datasets reach a terabyte. So you have a few solutions left:
1. You fire a fresh postgres and you store your data there. It's going to take quite a bit of time to inject those 600M lines into the instance.
2. You have some kind of tool (DataFusion) that allow you to run SQL queries on those parquet files without going through a painful copy/insert process. Performances matter a bit, but not as much because you're already avoiding a full dataset copy and a large-instance expense in the process. Even a 40% perf hit vs Postgres is absolutely acceptable, because you're already winning on both sides (total execution time, financial expenses).
Yes, it's not the same, but they serve the same purpose. It's, honestly, not important if Datafusion or DuckDB are using the arrow memory layout or not. What matters is their ability to run SQL queries (or Map-Reduce workloads) on CSV/Parquet files _WITHOUT COPYING THEM_.
If you start comparing them to solutions that copy datasets, you haven't understood what problem they are solving. For that problem, use postgresql or bigquery.
DuckDB developer here, DuckDB does not use Arrow as its in-memory format directly but uses something relatively comparable and has interfaces for (quickly) converting data back and forth to Arrow. Expect a blog post on this soon!
Why aren't indexes being created on the RDS instance? I know it's out of scope based on the title, but also it's used for comparison in the final results table.
Might as well try Arrow Feather, there's an R binding for it. This Flatbuffers-based format can be memory mapped and it requires very little decoding. Super fast and light on resources.
Also, Parquet supports different encodings, some of which may be faster, denser, and/or consume less RAM, or all of the above.
What does Arrow have to do with Parquet? We are talking about the file format Parquet, right? Does Arrow use Parquet as its default data storage format?
But isn't Arrow a format too? As I understand it, Arrow is a format optimized for transferring in-memory data from one distributed system to another (ser-des), while also facilitating and optimizing certain set operations. From RAM in one system to RAM in another.
Moreover, since Arrow is a format, why is it being compared to databases like SQLite and DuckDB? If we're talking about formats, why not compare Arrow queries against Parquet data to DuckDB queries against Parquet data? https://duckdb.org/docs/data/parquet
Why not at least benchmark the query execution alone instead of startup and loading of data? For Arrow, isn't it assumed that there is an engine like Spark or Snowflake already up and running that's serving you data in the Arrow format? Ideally, with Arrow you should never be dealing with data starting in a resting format like Parquet. The data should already be in RAM to reap the benefits of Arrow. Its value proposition is it'll get "live" data from point A to B as efficiently as possible, in an open, non-proprietary, ubiquitous (eventually) format.
Exactly what of SQLite, DuckDB and Arrow is being compared here?
I would assume the benefits of Arrow in R (or DataFrames in general) would be getting data from a data engine into your DataFrame runtime as efficiently as possible. (just as interesting might be where and how push-downs are handled)
Perhaps I'm missing the trees for the forest?
No disrespect to the author... Seems like they're on a quest for knowledge, and while the article is confusing to me, it certainly got me thinking.
Disclaimer: I don't read R too good, and I'm still struggling with what exactly Arrow is. (Comparisons like this actually leave me even more confused about what Arrow is)