I'm trying to better understand why analysts and data scientists sometimes prefer to set up a data lake, which I define as: "a collection of files, in an open-source format like Parquet or Orc, in a blob-store like S3". I often hear two reasons:
- Cost
- Supporting data science workloads
However, the emergence of data warehouses that separate compute from storage, like Snowflake, BigQuery and Redshift RA3, has drastically reduced the cost advantage for data lakes. And in my experience, data science workloads are generally compute-bound, so in principal you can just execute a SQL query from your data science environment (Spark, Python) against your data warehouse---the overhead of serializing and deserializing your data an extra time should be a tiny % of your overall runtime.
I suspect that the reason why some data scientist prefer data lakes is cultural rather than technical---basically, they have an irrational dislike of SQL and RDBMSs. However, I haven't done data science work myself in years, so maybe I'm just not getting it. Are there good reasons for using a data lake rather than Snowflake, BigQuery, or Redshift RA3? Please enlighten me!
"Raw" data lands in the lake S3 or similar object store. Processing pipelines take data, slice and combine, and push data back to the lake. Data in the lake will vary from the truly raw like log files to highly processed. Keeping the raw analogy we could say it's like the difference between a cow carcass and a carefully butchered filet mignon.
But you nearly always need a way to query and analyse the raw data, so you have an engine sitting on top like snowflake. Engines might help drive your processing as its much easier to write pipelines against. But also allow people to access at a higher level of abstraction.
It's not an either or.