As a software engineer who later learned SQL, I could not disagree more. Within the parameters that it is designed for, SQL is a terrific language that makes exploring and manipulating data much easier than tools like python or Scala. That doesn't mean I have no place for python or Scala, but that I definitely see a class of problems where an SQL interface is far superior.
I use python/Pandas every day for data analysis and the like, and I would never dream of not writing most of the aggregation and filtering logic in SQL. If you're working with large datasets, there is absolutely no reason to pull unnecessary data into memory.
I'm not sure what it is today - I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
That is, minimize the network bandwidth by putting the work on the DB engine.
This of course necessitated creating and understanding proper SQL query building practices. It was real easy to mess up if you didn't know what you were doing (ie - inner selects, improper joins, etc) and cause a combinatorial explosion that would consume all the RAM on the server and grind it to a halt.
That, or bring back a load of data that you then filtered on the "client" - better to let the DB server do that if you can. Of course, this was back when the clients were 486s and early Pentiums with maybe 8-16 MB RAM. Today it's a bit different, but you still want to minimize the network traffic.
> I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
We're a machine learning shop, and this is absolutely one of our core design principles.
Pulling into memory is an attribute of implementation - you could write LINQ in C# (which is a great abstraction too) and not care about the fact that it's translated to SQL that runs server-side.
Depending on the use case, I'd argue Spark can be a better choice for aggregating/filtering than SQL. SQL is great for simple queries, but once my queries start getting into the many hundreds of lines than I start to miss all of the complexity management features of a true programming language.
There are a few things, but not much that I have had a better experience in Spark with as compared to using something like apache pig with UDFs. Now this part might be a matter of how things are set up where I work, but I find that working with Tez for process management and debugging to be far easier than working with the process management built into Spark.
EDIT: when you read process management above, perhaps it's better to think task management.