As a software engineer who later learned SQL, I could not disagree more. Within ...

cirgue · on Feb 13, 2019

I use python/Pandas every day for data analysis and the like, and I would never dream of not writing most of the aggregation and filtering logic in SQL. If you're working with large datasets, there is absolutely no reason to pull unnecessary data into memory.

cr0sh · on Feb 13, 2019

I'm not sure what it is today - I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.

That is, minimize the network bandwidth by putting the work on the DB engine.

This of course necessitated creating and understanding proper SQL query building practices. It was real easy to mess up if you didn't know what you were doing (ie - inner selects, improper joins, etc) and cause a combinatorial explosion that would consume all the RAM on the server and grind it to a halt.

That, or bring back a load of data that you then filtered on the "client" - better to let the DB server do that if you can. Of course, this was back when the clients were 486s and early Pentiums with maybe 8-16 MB RAM. Today it's a bit different, but you still want to minimize the network traffic.

cirgue · on Feb 13, 2019

> I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.

We're a machine learning shop, and this is absolutely one of our core design principles.

grigjd3 · on Feb 14, 2019

The hardware progress since the 486 era, along with massive parallelization, has only reinforced this approach.

tomaskafka · on Feb 13, 2019

Pulling into memory is an attribute of implementation - you could write LINQ in C# (which is a great abstraction too) and not care about the fact that it's translated to SQL that runs server-side.

chibg10 · on Feb 13, 2019

Depending on the use case, I'd argue Spark can be a better choice for aggregating/filtering than SQL. SQL is great for simple queries, but once my queries start getting into the many hundreds of lines than I start to miss all of the complexity management features of a true programming language.

grigjd3 · on Feb 13, 2019

There are a few things, but not much that I have had a better experience in Spark with as compared to using something like apache pig with UDFs. Now this part might be a matter of how things are set up where I work, but I find that working with Tez for process management and debugging to be far easier than working with the process management built into Spark.

EDIT: when you read process management above, perhaps it's better to think task management.