> What does Arrow have to do with Parquet? We are talking about the file format ...

chrisjc · on Oct 27, 2021

Makes sense. I should have included this functionality in my description of the value Arrow brings:

> read ... into a typed Arrow buffer backed by shared memory, allowing code written in Java, Python, or C++ (and many more!) to read from it in a performant way (i.e. without copies).

Very powerful indeed.

You lost me here though:

> Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already!

What is already being used for what?

The example that follows that describes the advantages of PySpark (Python/Scala) using Arrow makes sense, but I'm having trouble understanding your assertion relating it to SQLite and DuckDB?

seertaak · on Oct 28, 2021

> What is already being used for what?

Let's say you have some data. You can choose to store it in a relational DB, like SQLite or DuckDB, or you can store it in a parquet file (and load it into an Arrow buffer).

And the point is that if you combine Arrow with, say, Spark, then as a user you can accomplish something similar to what you might accomplish with a relational DB. But you don't need to hassle with setting up a DB server and maintaining it. All you need is a job that outputs a parquet file, and uploads that to S3. And then Spark - through Arrow! - will allow you to execute queries against that DB.

Using Arrow + Spark, you get the ability to a dataframe as if it's SQL, but you can still do pandas-style stuff i.e. treat it as a dataframe. OTOH you lose the more esoteric SQL stuff like fancy constraints, triggers, foreign keys.

marcinzm · on Oct 27, 2021

>Next, why compare Arrow with SQLite and DuckDB? Because it's what it's being used for already! For example, PySpark uses Arrow to mediate data between Python and Scala (the implementation language), providing access to the data through an SQL-like language.

That's like comparing SQLite to Scala because Spark is written in Scala and exposes a SQL interface.