SQL is having somewhat of a moment in the bigdata world, thanks in part to 'modern datastack' and new age datawarehouses like snowflake,bigquery.
However there are a lot of pushback from 'traditional' dataengineers who were trained on spark/scala. Its bit of hardsell to go from a highly typed language to a free for all text based logic.
I think the following is needed for sql to be finally accepted as 'serious' contender.
create compiled sql language ( not pandas)
1. that compiles to sql and addresses some of the issues bought up in the post like nested aggregations.
2. make code reusable. Eg: apply year over year growth to a table that has the requisite columns. Compiler should check this in ide.
3. make materializations first class concept in the language. No seperate dbt layer.
4. crate a way to package and distribute libraries that you can import into your project .
5. a unit testing framework that makes it easy to test the logic without having to setup test tables in the database.
Yea I do that via dbt by setting up a mock data tables in database and using a macro to use those as sources/refs when run in test mode.
However what we are doing here isn't 'unit testing' its a black box integration testing. When I write equivalent code in scala, i just test the logic via unit tests,
eg: logic to filter out some orders that don't qualify, I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.
Analog here is using selenium or some ui testing framework to test if button turns blue if order exceeds limit. Thats not unit testing.
> I extract method in scala code and just test that logic as part of development lifecycle. There is no dependency on a database.
Unit testing seems to depend on where the unit of code is which is being tested. At the middle tier, you may mock out parts of the code so the tests aren't reliant on external sources (apis, databases, libraries, etc.).
It seems that unit testing database code would happen at the database layer: it's still a unit test as the test isn't dependent on external sources.
I was going to say the same. With the prevalence of embedded databases, and how cheap it is to stand up a container with non-embedded options, build time testing of queries has never been easier.
It seems like you're suggesting that someone could design a functional-style programming language that compiles to SQL.
2 & 3 are my biggest pain points. I can't just extract functions like I can with a regular programming language. Instead, SQL queries get increasingly complex with no great tools to manage that.
For 3, products like https://materialize.com/ look interesting for being able to create derived materialized views that can efficiently be kept up to date.
I'm not familiar with Ecto, dplyr, or DBT, but I would love an ML-like language to replace SQL. I'm imagining being able to pass a table (or any table-oriented data, like a sub-query) to functions that would type-check columns and would return table-oriented or scalar data. I'm not sure if this is actually possible in practice, but one can dream.
For instance, a "top 10" function that could be re-used on any table (apologies for my pseudo types and code):
Not 100% sure about what you're suggesting, but wouldn't it be easier to pass your functions to your table/sub-query?
And that's exactly what you're able to do in most of the modern data warehouse services such as Snowflake. Inferences can be contained within internal/external user defined functions.
This is very reminiscent of made the big-data/map-reduce movement so notable, sending your query to the data instead of moving your data to the query. Sending your model to the data, instead of sending the data to the model.
I don't quite follow what you are saying. In some SQL engines you can use row_number() function and derived tables (or CTE) to get top 10.
SELECT Id,col1,col2
FROM
(
SELECT
Id
,col1
,col2
,row_number () over (partition by columnkey1, columnkey2 order by anycolumnwilldohere desc) as _row
FROM _table
) as anytablealiaswilldohere
WHERE _row <=10
My idea is you wouldn't have to cut and paste implementations like this. You would be able to just call the function with a column name and table name.
I'm currently doing work that uses Elixir's Ecto which goes a great deal towards what I think you're aiming at. I can write my SQL in a familiar, yet functional and composable style; while knowing what SQL will be produced in the end. Ecto, as I understand it, was inspired by Microsoft's/c# LINQ. I've not worked with that, but heard similar praises for that as exists with Ecto.
I'm saying this as with most of my experience being in SQL.
However there are a lot of pushback from 'traditional' dataengineers who were trained on spark/scala. Its bit of hardsell to go from a highly typed language to a free for all text based logic.
I think the following is needed for sql to be finally accepted as 'serious' contender.
create compiled sql language ( not pandas)
1. that compiles to sql and addresses some of the issues bought up in the post like nested aggregations.
2. make code reusable. Eg: apply year over year growth to a table that has the requisite columns. Compiler should check this in ide.
3. make materializations first class concept in the language. No seperate dbt layer.
4. crate a way to package and distribute libraries that you can import into your project .
5. a unit testing framework that makes it easy to test the logic without having to setup test tables in the database.