SQL is one the most amazing concepts I've ever experienced. It's nearly 5 decades old and there is no sign of a replacement. We've created countless other technologies to store and process data, and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).
I run a early stage company that builds analytics infrastructure for companies. We are betting very heavily on SQL, and Craigs post rings true now more than ever.
Increasingly, more SQL is written in companies by analysts and data scientists than typical software engineers.
The advent of the MMP data warehouse (redshift, bigquery, snowflake, etc) has given companies with even the most limited budget the ability to warehouse and query an enormous amount of data just using SQL. SQL is more powerful and valuable today than it ever has been.
When you look into a typical organization, most software engineers aren't very good at SQL. Why should they be? Most complex queries are analytics queries. ORMs can handle a majority of the basic functions application code needs to handle.
Perhaps going against Craig's point is the simple fact that we've abstracted SQL away from a lot of engineers across the backend and certainly frontend and mobile. You can be a great developer and not know a lot about SQL.
On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
But if you really understand SQL, and it seems few people truly today, you command an immense amount of power. Probably more than ever.
"we always seem to try to re-create SQL in those languages (e.g. Hive, Presto, KSQL, etc)."
This is largely because of the number of non-programmers who know SQL. Add an SQL layer on top of your non-SQL database and you instantly open up a wide variety of reporting & analytics functionality to PMs, data scientists, business analysts, finance people, librarians (seriously! I have a couple librarian-as-in-dead-trees friends who know SQL), scientists, etc.
In some ways this is too bad because SQL sucks as a language for many reasons (very clumsily compositional; verbose; duplicates math expressions & string manipulation of the host language, poorly; poor support for trees & graphs; easy to write insecure code; doesn't express the full relational algebra), but if it didn't suck in those ways it probably wouldn't have proven learnable by all those other professions that make it so popular.
SQL is amazing IF you understand it. You need to think in sets of things. It's like functional programming paradigms or recursion; once you really truly "get it" you start to feel like a Jedi master.
Unfortunately the vast majority of SQL users aren't that proficient. It's also fairly hard to learn because it's something that you only pick up with experience and specifically longer time experience with a sufficiently complex data model.
I personally would never have gotten good at SQL if I hadn't stayed in my first two jobs for 5 years each working daily with the data-models and the domain.
I was fortunate that my early SE career experience involved a lot of SQL (and my first experience with a SQL-like query language happened under PICK).
But as my career has gone forward, I'm touching it less and less; today, I hardly touch it at all outside of my personal usage.
Much of that has to do with the fact that I'm now employed building and maintaining an SPA using javascript and nodejs where the backend is accessed thru a RESTful API; we never get to touch the actual database.
The few times I have seen some queries for that DB - albeit not in our API, though I could probably find them somewhere - all I can hope is that the SQL engine being used does some kind of query optimization on-the-fly, because there's so many inner selects that make me cringe it ain't funny (like I wonder if they've heard of joins and such).
Before, I was involved in a lot of PHP web apps and backend server automation, where I needed to use SQL a lot; I feel like I am getting rusty in it.
You could make a language that had all those benefits + almost all the things nostrademons mentioned. For example, Apache Spark on top of Scala does this pretty well (but it's significantly more complex than SQL is.) Pandas in Python and SAS also have many of these features (but also have a higher learning curve.)
SQL certainly does not suck as a language, unfortunately this is a very common perception, almost a (false) consensus within the developer community.
I changed careers from Finance - I was a ACA (akin to CPA) , wizard at Excel VBA etc - to software development, after 10 years learning, I’m now finally also proficient with SQL. I have also learnt other langauges, but SQL is by far my favourite, Yes it has some mistakes in it’s design - the biggest being SELECT before FROM, Yes some key words, esp windows functions, have a silly length - this creates an impression of verbosity but SQL is actually the opposite, it is far more expressive than any Object Oriented language I can think of.
1 line of code, say a window function with a filter clause in Postgres is the equivalent of pages of Java code.
Lateral Joins enable chaining, Views CTEs and Functions provide simple safely scoped composability. Postgres has incredible aggregation and analysis ability, built in out of the box. It’s String manipulation capabilities are wondrous, a 1 line StringAgg function can achieve the same as pages of VBA code - you may scoff but VBA with it’s ancient horrible editor is still necessary, if you are in Excel, is Javascript that much less verbose though?
Pure SQL Postgres Functions are amazing, you can chain them, you can easily make them pure, they can include fast recursion and conditional logic. I never delete data, just append, a bitemporal immutable functional db is so powerful, enabling time travel and rock solid data integrity. Mat Views or Indexes or summary tables are easily created and often automatically. They provide an efficient reliable cache layer, which can address most performance issues that may arise
You're confusing the language and the databases engine. I don't think that seasoned developpers think that relational databases sucks per se, since most of their constraints are technically justified, but don't like SQL as a language because it's grammar is completely awful, it's inconsistent (toward itself and databases engines), filled with specific perks and clumsy.
To illustrate on `select` queries, you start listing the attributes then the table, while on `updates` you start by specifying the table and then the attributes on which on operate.
This illustrate the grammar problem: in one case, you start by bringing what table you will use and set on which attributes, the other the attributes you need, while keeping in mind on which table name since you specify it after.
It's not really a problem, but developer tends to hate any kind of cognitive load, and this one source of load.
I'm not an ORM fan, but developper often use the programming language of their application to build SQL queries string, and ie. with such tools you always start by specifying the target table.
Personally I really wished that RDMS would provide another intermediate language, or better, data structure, to interface with them.
SQL is wildly powerful and important, but it's also got clear deficiencies for data traversal and manipulation.
I see it on a spectrum between declerative and imperative data structures, and where I think most people go wrong with it is trying to create a monolithic solution to a broad spectrum of problems. I think you need a graduated approach where each data layer is simplifying and satisfying the next, so you're using Tables, Procs, Views, and in-memory constructs in concert. The database is a powerful tool, and SQL is just part of that bigger puzzle :)
I can't agree more with this. As with anything we use in our applications, understanding and leveraging the strengths of our different tools is vitally important. If, for example, I needed a cartesian product of two datasets, SQL is the first thing that comes to mind due to how simple it is to write and the speed with which it will be processed in the database. On the flip side, I would never want to intermix frontend code in SQL, which crazy as that sounds, I have seen before. Everything has its place in a logical development flow.
It completely blows my mind when I see stuff like this. We're not all perfect programmers, I'm sure, but still. An alarm should go off in your head when the thought occurs to put front-end code into a query. The only exception I've encountered is if I want to inject some HTML formatting into a string for emails sent from SQL Server, but even then it's extremely limited in use, and I still think to myself, there must be a better way..
Starting with the end in mind (hi covey) is actually not such a bad idea. First state what result you want the query to produce,and then start describing where that data should come from.
Updates aren't different from select statements: first you state what you want, update a table with some new column values, and then you state where this data should come from, and what data you want to update.
And this is why it sucks. Which is fine, the language comes from a different era when our understanding of computer languages was much more primitive. It is understandable that mistakes would be made. What is unfortunate is that there has been little to no progress in improving on those mistakes in this problem space.
In the procedural world, you could also say that C has some mistakes in its design. However, we've gone to great lengths to try and improve on C, for example in semi-recent times with Rust and Go. SQL could really benefit from the same. SQL gets a lot of things right. So does C. But we can do better on both fronts.
Unfortunately, it seems that people regularly confuse SQL with the underlying concepts that SQL is based on. Because there is no basically no competition in this space, it is assumed by many that this type of problem can be expressed in no other way, and that you simply do not understand SQL well enough if you do not see it as perfect. I guess it comes from the same place of those who argue that if you make memory management mistakes in C, you just don't understand C well enough.
As a software engineer who later learned SQL, I could not disagree more. Within the parameters that it is designed for, SQL is a terrific language that makes exploring and manipulating data much easier than tools like python or Scala. That doesn't mean I have no place for python or Scala, but that I definitely see a class of problems where an SQL interface is far superior.
I use python/Pandas every day for data analysis and the like, and I would never dream of not writing most of the aggregation and filtering logic in SQL. If you're working with large datasets, there is absolutely no reason to pull unnecessary data into memory.
I'm not sure what it is today - I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
That is, minimize the network bandwidth by putting the work on the DB engine.
This of course necessitated creating and understanding proper SQL query building practices. It was real easy to mess up if you didn't know what you were doing (ie - inner selects, improper joins, etc) and cause a combinatorial explosion that would consume all the RAM on the server and grind it to a halt.
That, or bring back a load of data that you then filtered on the "client" - better to let the DB server do that if you can. Of course, this was back when the clients were 486s and early Pentiums with maybe 8-16 MB RAM. Today it's a bit different, but you still want to minimize the network traffic.
> I would hope the same mindset applies, but maybe not - but back when I was using SQL, the idea was to let the database engine do everything it could with the data, before sending the results over the pipe.
We're a machine learning shop, and this is absolutely one of our core design principles.
Pulling into memory is an attribute of implementation - you could write LINQ in C# (which is a great abstraction too) and not care about the fact that it's translated to SQL that runs server-side.
Depending on the use case, I'd argue Spark can be a better choice for aggregating/filtering than SQL. SQL is great for simple queries, but once my queries start getting into the many hundreds of lines than I start to miss all of the complexity management features of a true programming language.
There are a few things, but not much that I have had a better experience in Spark with as compared to using something like apache pig with UDFs. Now this part might be a matter of how things are set up where I work, but I find that working with Tez for process management and debugging to be far easier than working with the process management built into Spark.
EDIT: when you read process management above, perhaps it's better to think task management.
I understand your perspective, but I look at it a different way.
SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.
You're right about some of it (especially string manipulation, which is brutal) but I think you look at it like most programmers. SQL engages more of a simulative mindset—one popular with analysts—than the acquisitive mindset that most software developers outside of data science employ.
Most programmers who are at all familiar with functional programming, DSLs, configuration languages, or optimizing compilers are very well versed with asking the computer for what you want rather than telling it what to do. At least when I was there, this was a very large percentage of Googlers.
My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software. SQL doesn't have this structure. If you have a query that almost does what you want but you need to add one more filter, you need to reissue the query (modulo views/temporary tables, which is why I said "very clumsily compositional" rather than "not compositional"). If you have an expression that computes some quantity and then you want to use it with slight modifications in a lot of places, you usually end up copy & pasting it (modulo stored procedures). Modern programming languages have made this sort of abstraction really easy, but it's quite clunky in SQL. You can do it (by using stored procedures, views, triggers, subqueries, etc.), but then most of your application ends up written in SQL and it starts to feel like something out of the 1970s.
My preferred interface would be something like the relational algebra where relations are represented as typed values in the host programming language and operators are normal method calls (or binary operators, depending on language flavor) and importantly, intermediate results can be assigned to variables. I don't care what the particular execution strategy is of the query, but I do think it should be possible to refine, join, project, and subquery using values you've already defined.
> My issue with SQL is that a programming language should allow you to compose and name building blocks, and then recombine them to build ever more useful software.
When I write a complex query it's kind of like this.
I start with one table (viewed in my head as an excel style grid of results). I join another table then filter on the join / where clause. Again the output is effectively another table. Rinse repeat.
The difference is that SQL is very powerful at what it does, so you don't need to compose blocks in the way that you would in other languages. Just think of everything as a table. The result of every join or filtering clause is another table. That's your building block.
You are going to need significantly less SQL than Java / Python to get the same results from your data.
True, if you modulo all the features like views, stored procedures, functions, foreign keys, triggers there is no reusability in SQL...
On the serious side: It's not a bad idea to contain critical business logic in the database. It's shared by any app using the database. Foreign keys esp link and let you cascade changes with no extra code.
Less overall code. Higher guarantees.
Everyone agrees it's a good idea to use a `datetime` field for a `created` field instead of `varchar` and letting the application parse/extract it's own format in there.
It's also a good idea to add triggers that do specific things to keep your db in a consistent state upon actions.
Maybe adding/deleting a row needs to update a counter in a statistics table for quick access because count(*) on that table takes too long.
You could do that it in your app - but what happens if someone changes the data without going through your app.
Instead you add it as a trigger. Now you can be sure that the number in the stats table is always the same as the actual number.
Database changes have been painful for me in every job I have worked at. They often took me longer to write, longer to test, and longer to fix when I introduced bugs. In my experience, doing things in the database is great for simple actions, transformations done in huge volumes, preprocessing large datasets, or logic that is unlikely to change. It's bad for doing things which don't have those properties, especially the last property due to database changes being painful.
I don't disagree with him (though I don't necessarily agree as strongly with respect to triggers), but the counter argument is that it's extremely easy to put up a web service that offers an API and interacts with JSON. Now all the complicated bits are in the API service, and the multiple apps don't need to know anything at all about the underlying data store.
The counter to that is that it can be very difficult to get an API that is as comprehensive and flexible as SQL is, and eventually you'd find yourself re-implementing SQL if you ever tried. Certain reporting scenarios or mass data manipulation scenarios really require actual data store access in order to perform remotely well.
Your counter counter argument on not being as flexible as SQL is spot on. A DAL that sits in front of storage mechanisms is not a bad idea though.
I've found moderate success in doing the reverse and using communication channels from SQL => services either via NOTIFY or queues.
ie. Dispatching an event that signals a non SQL service to do an action. eg saving a file to S3, then updating the database when the action is complete.
This doesn't really work when retrieving data synchronously but in most cases if the data is stored elsewhere - then you probably want to access it via non db methods anyway (ie. for an S3 file, you'd use the url)
I don't like it either, but for most scenarios that this sort of thing works for a RESTful API or microservice or equivalent (or whatever you want to call it these days since I think the expiration date on those terms has elapsed) and not all data can be meaningfully coded without a structure like JSON or XML. Or, rather, you can, but, you're reinventing the wheel just like you would with an API trying to be as flexible as SQL. You may run into various system limits on URL length, too.
Events and queues are great, but not all requests work that way. If you're USPS and you're providing an address and ZIP code resolver, you've got different requirements than getting data from one system to cascade across a series of systems with a dozen different asynchronous widgets. You'd have a burden of using a format that your customers would prefer, too.
Having worked with three large JSON http microservice based projects in the past, I abhor the day they got popular. Relative productivity has definitely gone down due to the project overheads caused by them.
Not saying microservices don't work, they are great for specific use cases, but 9/10 people just want to microservice everything just to say they use microservices.
> If you're USPS and you're providing an address and ZIP code resolver
Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?
> Not sure I see your point with this example - isn't a resolver like this just a fixed database of entries?
Until you get to how well it handles misspellings and incomplete information. Addresses are also notoriously difficult to parse. There is some logic and ranking at work behind the scenes. The only times I've seen it consistently fail are when the city is incorrect, or it's a genuinely new address (new construction or address renumbering).
In any event, is a read-only service somehow less of a service? I'd wager read-only services see a lot higher demand than anything.
Not a down voter, but I imagine it could be because triggers are associated with write amplification and visibility concerns. They have their place, just like stored procedures.
This is a very real problem I bang my head against regularly. There just seems no way to achieve all three of readability, maintainability and performance in a large enough SQL codebase. You can piece together views right up until the moment the query planner forgets to push where clauses down. You can wrap a query in a function to guarantee the where clause is evaluated then and there, but now you have to maintain that _and_ your view/table. You can slowly rewrite your code bit by bit, adding complexity to force the query planner (in this supposedly declarative language) to behave the way you want it to, for identical results. You can wrap all of it in a materialized view so you don't have to care about performance anymore, right up until the point it takes 12 hours to refresh.
I've be very much in the market for something more modular than SQL, which had a much more customisable query planner so you could say "I don't care if you go away and compile this for an hour as long as you make it as quick as physically possible and then save the query plan forever".
Yeah, some of it is based on statistics, although in my experience, the relative distributions of rows don't massively change in my databases, so an ostensibly optimal query plan on day one isn't likely to be rubbish later. You can obviously imagine other people's mileage varying.
> My preferred interface would be something like the
> relational algebra where relations are represented
> as typed values in the host programming language and
> operators are normal method calls (or binary operators,
> depending on language flavor) and importantly,
> intermediate results can be assigned to variables.
Much if not all of what you're asking for can be done today at the database level with modern SQL affordances (CTEs, etc) and vendor-specific stuff like stored procedures. You can write functions that return tables/resultsets, assign those results to variables, etc.
There's not a host programming language database library that exposes those things in quite the manner you seem to be asking for, but that's not a limitation of these SQL-based RDBMSs themselves.
In the realm of what's possible right now, you could of course write your stored procedures directly in .NET languages (MSSQL) Python (Postgres) etc.
'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.
Your preferred interface sounds like Linq. An IQueryable<T> interface represent a query and can be composed and assigned to variables, and the type T represent the type of the relation or projection. Linq composes a lot more elegantly than SQL itself. The instances does not represent intermediate results though, they represent composed queries, which are executed in one go on demand. But I think that is what you want anyway.
> 'With'-clauses (aka Common Table Expressions) allows you to name and compose subqueries much more cleanly.
These are great, but their implementation varies pretty significantly. Notably, MySQL didn't have them at all until v8.0, and PostgreSQL's CTEs are... wonky, for lack of a better term (I believe they use an "evaluate and store" method that's closer to a temp table, and as a result they don't optimize well and have been known to have strange side effects). Not all RDBMSs support recursive CTEs, either.
There are indeed a number of decent swipes at the idea, like Linq. But they all suffer from some impedance mismatch. It would be better if the native language underneath was written to support it more natively.
As you and others cite, there's a variety of features that have crept up on the idea over the years, but they're all bodging things on to the side of something they can't really change. I'd love to see the Rust of database querying to SQL's C++. I draw that comparison because I think a lot of the same structural problems are in play preventing it; SQL is just so good that it is actively prevented from being fundamentally improved. It can be incrementally improved, but not fundamentally.
Views - Create a general top level view, then build more specific views on views using more filters, to go down, ie more granular - then just join these with yet more views to combine, or aggregate to go back up. How is that not composable? If you hit performance issues, they are easily solved by using a few materialized views.
Also CTEs and User defined Functions (I use pure SQL functions but in Postgres you can easily use Python or Javascript instead)
At least in the DB we're using, Sybase SQLAnywhere, materialized views comes with a hefty price tag. They must be dropped and recreated every time you touch any of the base tables, like adding a column, which in turn requires any indexes on the materialized views to be recreated.
For a few of our customers, that meant that a 15 minute DB change (adding a column) turned into a several hour DB change (rebuilding materialized view).
If they're refreshed immediately, they also incur a penalty on the underlying tables, since any data changes essentially runs a trigger to see if the materialized view needs changes as well.
In our case we had one such immediate refresh materialized view which caused such a performance degradation of an underlying table that we had to find an alternate solution.
Materialized views for sure add some complexity and overhead, but I think that's true of any caching mechanism. They should probably be one of the last tools you reach for when trying to optimize queries, but in certain critical situations they can be super effective.
Well, for immediate refresh materialized views, how do you keep your non-sql solution up to date?
For manual refresh, I agree, either non-sql or just plain temp tables is a decent alternative. In our case the views do a lot of joins and subqueries, so we've mostly used temp tables.
I think you make a valid point about compositionality. SQL lacks elegant composition at the source/module/syntax level. Views and stored procedures are all stateful ways of achieving something similar by using them also involves migrating that state. I don't think you're saying SQL is not expensive which seems to be the rebuttal offered by siblings.
I am the author of a small parser/typechecker for a subset[0] of SQL, roughly matching that supported by SQLite, and I agree with you COMPLETELY. What follows is my rant directed at those who don't.
It is an ugly, verbose language. You can be very familiar with thinking in sets, and still not like SQL.
It's what we've got, and the SQL databases available are very, very good products. But I do wish a cleaner language could have won.
As you know, anyone who's used LINQ in C#, particularly by directly calling the extension methods .Select(...).Where(...).OrderBy(...),
sees how much better it is from a composability standpoint.
SQL is the anti-Lisp. Lisp's design is about 4 pieces of syntax and a few fundamental operations from which all else is built.
Conversely, nearly every operation in SQL is a tacked-on special case to the ridiculously complex SELECT syntax.
Filtering results? That's a clause of SELECT.
Ordering? Clause of SELECT.
Filtering after aggregating? Oh, that's a different clause of SELECT.
Once this philosophy has infected the brain of a SQL implementer, it spreads like wildfire.
That's why you even see custom syntax pop up even in good'ol function calls sometimes, like in Postgres: overlay('abcdef' placing 'wt' from 3 for 2).
SQL fans often talk about the beauty of relational algebra. Once you achieve relational enlightenment, SQL is supposed to be beautiful.
But if we wrote math like SQL, you wouldn't say 2 * 3 + 4. There would be a grand COMPUTE statement with clauses for each operation you could wish to perform. So you'd write COMPUTE MULTIPLY 2 ADD 4 FROM 3. Of course, the COMPUTE statement is a pipeline, and multiplication comes after addition in the pipeline, so if you wanted to represent 2 * (3 + 4) you'll push that into a sub-compute, like COMPUTE MULTIPLY 2 FROM (COMPUTE MULTIPLY 1 ADD 4 FROM 3).
SQL clauses could have been "functions" with well-defined input and output types, if the language designers had come up with a type system to match the relational algebra.
These could be pipelined, rather than nested, either with an OO-style method call syntax or a functional style pipe operator.
You understand the idea. Again, as you've mentioned, the LINQ methods[1] are a great resource for people not familiar with this style.
But, counterargument. Languages with minimal syntax and great power are often claimed to be unreadable.
Sometimes it's nice to have special syntax, to help give a recognizable shape to what you're reading,
instead of it being operator / function call soup.
So what did SQL accomplish by making everything a special case of SELECT?
Well, it's got this sensible flow to every statement. You see, the execution of SELECT logically flows as I've numbered the lines below.
SELECT
8 DISTINCT
7 TOP n
5 column expressions
1 FROM tables
2 WHERE predicate
3 GROUP BY expressions
4 HAVING expressions
6 ORDER BY ...
I sure am glad they cleared that up. If it had been a chain of individual operations I'd have been utterly baffled.
Ignoring completely the syntactical design, the lack of basic operations is a pain too.
Why can't I declare variables within queries?
I often would like to do something like:
select
let x = compute_something(...)
in
x + q as column1
x + r as column2
from ...
Instead, when I really need that, I end up wrapping the whole thing in an outer query and computing X in the inner query.
Hooray for SELECT, the answer to all problems!
Oh and yes, views and stored procedures and UDFs are no answer to the need for one-off local composability within queries.
Then you have the sloppy design of the type system in virtually all SQL dialects. SQL Server doesn't even have a boolean type.
There is no type you can declare for `@x` that will let you `set @x = (1 <> 2)`.
And don't even get me started on the GROUP BY clause, the design of which contorts the whole rest of the language.
If you GROUP BY some columns, you must not refer to any non-grouped columns in the later parts of your query (refer to my table above for which parts are "later").
Unless, that is, you are referring to them in aggregates. Then the HAVING clause was tacked on so that you'd have a way to do a filter -- the same thing as WHERE -- after the GROUP BY.
Does it all make sense once you understand it? Yes, in that you can see how you'd end up with this system if you were adding things piece by piece and never went back to redesign from square one.
Wow, I have a lot of ranting to do about SQL. I feel like I haven't even scratched the surface.
And hell, I still pick SQL databases every time I start a project! The damn language is useful enough and the products work great.
But it has all the design elegance of the US tax code.
You make some good points about SQL's shortcomings but some of the claims you make are infactual or seem intentionally misleading. Since you mentioned SQL Server, I have to chime in with some corrections to your statements
While you can't declare new variables within a SELECT statement, you can modify them with functions:
DECLARE @foo INT;
SELECT
@foo = dbo.ComputeSomething(...),
@foo + @bar AS Val1
FROM
...
Additionally, SQL Server does have a boolean type by another name: bit. It is 1/0/NULL and you can convert the string values TRUE and FALSE into their respective numerical values.
To modify your example to use proper SQL:
DECLARE @x BIT;
SELECT @x = 'TRUE' WHERE 1 <> 2
Or, to use a more real-world example, you'd probably use a CASE statement:
SELECT
@x = CASE WHEN 1 <> 2 THEN 1 ELSE 0 END
Yes, it's verbose. But it's possible. (SQL's slogan, probably, and understandably the root of a lot of your remarks.)
Finally, your GROUP BY / HAVING / WHERE example lacks an important distinction. HAVING <> WHERE. HAVING is specifically meant to deal with aggregate functions, while WHERE is a simple filter. You wouldn't use "HAVING PostId = @foo" any more than you might use "WHERE COUNT(*) > @bar". This is why it comes after the GROUP BY clause -- which you should think of as a "key" for the data that's being aggregated.
Look, SQL is not perfect or near perfect. As with any language, it could use improvements. But to criticize a language based on ignorance, misleading statements, or use cases that are not meant for the language is fairly dishonest, in my opinion.
If I use your variable example I get:
A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations.
I'm assuming there is a way to make it work, but then it seems like it'd be kind of a messy imperative-style approach. Would the value of @foo persist between each row's evaluation, so now my query would be capable of (intentionally or not) stateful execution? Would that interfere with the optimizer's ability to re-order evaluation or would it be undefined behavior?
Other than the special cased string value conversion, SQL server treats bit as just a very short integer type. You cannot supply a bit value alone to WHERE or CASE, you must use a comparison operator. And likewise, you cannot assign the result of a comparison expression to a bit, you must wrap it in CASE as you have demonstrated. In fact my own project, in an attempt to squeeze various dialects of SQL into a common type system, fakes booleans in T-SQL by inserting CASE expressions and (<> 0) comparisons as needed See example:
If a normal programming language had special places where boolean expressions could appear (like in if and while loops), and boolean expressions did not produce values in and of themselves, it would stand out as being extremely hacky.
Of course that is a complaint only for T-SQL, other dialects do have true bools. But it is emblematic of design undertaken without effort to unify fundamentals (The question not asked by T-SQL's designers: "What if predicate expressions were just like any other expression, and had a value?") and that is what I find distasteful about SQL in general.
HAVING is another example of that lazy, slap-more-syntax-on design. As you correctly point out, HAVING evaluates after the GROUP BY, WHERE before, and that's why HAVING can refer to aggregate results. But if "WHERE" was just an operator it could be used in the order it was needed without hassle. In LINQ, you write:
users
.Where(u => u.AccountIsActivated)
.GroupBy(u => u.Name)
.Select(g => new { Name = g.Key, NumberOfUsersWithName = g.Count() })
.Where(g => g.NumberOfUsersWithName > 1)
.Select(g => $"The name {g.Name} is shared by {g.NumberOfUsersWithName} users")
The .Where before the grouping is the same function as the .Where after it. No need for a special case.
Of course you could do the same thing in SQL without HAVING using a subquery:
select 'The name ' + sq.Name + ' is shared by ' + sq.NumberOfUsersWithName + ' users'
from
(select u.Name, count(*) as NumberOfUsersWithName
from Users u
where u.AccountIsActivated = 1
group by u.Name) sq
where sq.NumberOfUsersWithName > 1
But it's ugly, so they threw in HAVING and that lets you do it all in one query.
select 'The name ' + u.Name + ' is shared by ' + count(*) + ' users'
from Users u
where u.AccountIsActivated = 1
group by u.Name
having count(*) > 1
Understandable choice, because subqueries start to get unreadable fast. I believe this is due to the middle-out (select outer-stuff from inner-queries where more-outer-stuff) syntax, which gets hard to follow when nested more than a level or two. I find the front-to-back pipeline approach of LINQ or F#'s Seq module far easier to track as data gets filtered and transformed in the same sequence that the code reads.
Let's go back to my needlessly cruel and exaggerated example of "SQL math", in which all expressions are written as a fill-in-the-clauses COMPUTE statement with each operation being its own optional clause, in a fixed order. Suppose that it was decided that it's helpful to be able to MULTIPLY after adding, which normally comes _before_ multiplying in the fixed order, but people were sick of writing:
COMPUTE MULTIPLY 3 FROM (COMPUTE MULTIPLY 2 ADD 1 FROM 3)
So they came up with a new keyword for multiplying that goes _after_ the addition step.
They'd use a different word so you could tell the operators apart, but it'd basically be a synonym.
COMPUTE MULTIPLY 2 ADD 1 PRODUCT WITH 3
That's how HAVING feels to me. It's a band-aid that wouldn't be necessary or even desired if WHERE was a standalone operator on table types, just as our hypothetical PRODUCT WITH clause wouldn't be desired if we had a standalone * operator.
I get that SQL works and is extremely useful. Remember, I spent quite a bit of time implementing a library for working with it. But there are languages that make you think things like, "Oh, that's simpler than I thought. I can use this kind of construct to iterate over anything that implements the right interface. That was smart of the language designers". Then there are languages that make you think, "Huh. Another special case. Couldn't they have thought this through a bit more carefully before speccing out the language?".
SQL feels like the latter and yet I see people call it a beautiful language or their favorite language and I get a strong "stop liking what I don't like" impulse.
The code snippet I posted doesn't adequately demonstrate it but the DSL is strongly typed with full type inference at every step. So for example, it knows that the type of the table after the "join" is has a field "age" of type "int", and that filtering on it is okay. Trying to filter on a non-existent field will cause a compile (not runtime) error.
Second, the "steps" are actually plain functions. `#` is simply the reverse function application operator defined as `a # f = f a`. Which means you can abstract operations. For example, you can define a generic function that can take any relation with an age field and filters for age > 60. Thanks to strong types, using it incorrectly (i.e. on a relation without an age field of type integer) would cause an compile error.
Also, the language itself is a deeply embedded DSL. So "ordinary" code also compiles down to SQL. You can do things like loops etc. (for example, create a function that filters over a list of age limits rather than just one) and it will try to generate the best SQL possible.
of course the SQL generation is currently not implemented, so all this is theoretical, but I'm getting there!
There is also the approach of https://querycombinators.org which is inspired by functional programming but results in a declarative language. Queries are composed of functions that can be named and tested separately.
You may be interested in my project "Binate"; I have some incomplete notes at http://canonical.org/~kragen/binary-relations about it, exploring a possible design space in which we might find a more concise and compositional query language. Of course that's not in a host programming language; for that there do exist some things like SQLAlchemy and SQLObject which, as a bonus, can compile the resulting query to SQL. The most popular of these is Django's "ORM", which unfortunately is very limited.
I don't see views and subqueries as clumsy. Especially with sets, you think in creating new sets, and combining those into other sets.
In postgres i've created financial year reports, with monthly summaries per category just by having a few layers of views.
I think it's really valuable i can think in logical sets, and the database will takes all those layers of views and combine those into one optimized query plan.
But views and functions still require you to persist those objects in the database first. There's no such thing as a query "variable" that you can then re-use in multiple subsequent statements. Of course, you can use table variables or temporary tables to hold intermediate data, but those are eagerly evaluated, whereas functions, views and CTE's are lazily evaluated, and that allows for a massive performance boost (due to optimization).
I can see the appeal of such a construct in SQL. The requirement that reusable objects are persisted in the database requires a top-down design approach, and that doesn't really blend well with modern coding practices.
There is no limit to reuse of views??
You can create as many intermediate or base views as you like, within a view definition, CTE’s have local scope and are similar to variables, if you need schema scope, just define the query block as a view instead.
UDFs are simple to create are lazy and can be easily reused and composed, in Postgres they are first class objects, scalar UDFs can be called anywhere in SELECT and Set Returning UDFs in the WHERE clause, use more than 1 and they also automatically Lateral Joined - which is incredibly powerful and composable - when you get your head around them
No, CTEs do not have local scope, at least not like variables. CTEs are statement-scoped; no matter how many SQL statements you have in your current scope, your CTE vanishes after the first statement terminates.
> SQL is troublesome to some programming types because it seems alien to ask what you want instead of telling the computer what to do and I find most programmers, especially ASD-types (who I think have an edge for some situations, like writing certain code in a huge org like Google) find this an unfamiliar and strange way of thinking.
That's basically how most tools work outside of software development. When you use Google, you aren't telling it how to find your result; you're telling it what you want.
Meh, not quite. We've gotten so accustomed to phrasing our requests in search-engine-friendly terms that we don't realize we're doing it (e.g. cutting out all the small words).
You’re definitely right that SQL is alien to many procedural / object oriented / functional / etc developers.
Even with myself, who has worked with Oracle writing PL/SQL (which is an abomination imo) in past jobs still has to sit down and get into the right mindset before writing efficient SQL for more complex queries.
I think SQL is entrenched by network effects. It has a unique conceptual paradigm, which is why it is hard to learn. But any replacement is also likely to be conceptually strange, but noone is gonna put in the time in to learn something weird unless it has the adoption of SQL. This we are stuck on a local mini.a with SQL. (A pretty good one though)
It is not hard to learn - non programmers use it all the time.
Conceptually SQL is much more simple than programming, it basically reads like english:
SELECT customer, SUM(total)
FROM orders
GROUP BY customer
WHERE created BETWEEN '2018-01-01' AND '2018-12-31'`
ORDER BY SUM(total) DESC
Compare that to the programming necessary to implement the above:
totals = {}
for row in rows:
if row.created > '2018-12-31':
continue
if row.created < '2018-01-01':
continue
if row.customer not in customer_totals_2018:
totals[customer] = 0
totals[customer] = totals[customer] + row.total
def _sorter((customer, total)):
return -total
for customer, total in sorted(totals.items(), key=_sorter):
print(customer, total)
Not to mention the SQL version gets first hand knowledge on available indexes in order to speed up the query.
Now imagine adding an AVG(total) to both the SQL and programming version...
I don't think it's non-programmers that keep SQL around; it's that there hasn't been a replacement for SQL that is better enough to replace it. I think C suffers from the same problem.
I don't disagree. I just don't think we've figured out the next step. Other paradigms (MapReduce and graph databases are perfect examples) have introduced very interesting and clever ways to querying data, but nothing has replaced SQL. If anything, it seems like it will be additive on top of what SQL has already done.
I know a lot about data, but I can't solve that problem. To anyone out there, much smarter than me: this is a real problem. If you solve it, you'll cement yourself the history of computer science.
I think graph databases with time will eventually be eating some of the RDBMs market (used for the same purpose), but they haven't had the time to mature as SQL/RDBMs.
I’m dubious, Postgres added OO features when that was the fashion, then XML, then JSON. To my mind Graphs are still relations just indirect one, I use graphs a lot within Postgres no problem, it has native recursive CTEs which are surprisingly fast for most Graph queries. For v large graphs I cache the graph with an automatically generated transitive closure table or mat view. With Postgres you can do union distinct inside the Recursive CTE to prevent cycles and it’s remarkably fast, millions of rows under 1 minute.
I agree with you and the parent comment. I run a data science department and we rely on SQL. We even embed SQL queries into R scripts to pull the latest data. It isn't worth working with ORM for an analytics project when a few dozen lines of SQL can get you everything you need.
> On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
I always ask a basic SQL question (involving sorting/grouping and one simple join) in my interviews for backend and/or data. IMO if you are calling yourself a Data Engineer and do not know SQL, then you haven't really worked with data, you've mostly just worked on data infrastructure.
(please excuse my typing. i have one hand to use ATM)
yup. we bet super hard on sql too. Its fantastic if you know what you're doing. We ca import arbitrary data and expose it as normal tables to our customers for analysis/transform/export along with having a silod access controlled place for them to see just their data. sql is a great technology, and is extremely flexible.
I love it. If you're looking to learn it and fundamentals, check out Jennifer Widom's course from Harvard. I can safely credit her w/ my career.
Certain subsets of the developer community are usually excellent with SQL. It's especially evident with people who have been working with C# for > 10 years because Microsoft have always heavily pushed SQL Server with their developers.
There are plenty of use-cases for SQL with developers: especially in batch processes such as invoicing. A well crafted SQL query can execute exponentially faster than iterative code takes a lot less time to implement.
This tells us that people think SQL is good, but it's unfortunate that this is attempted, because the good thing about SQL is how well it corresponds to its data structures. Attempting to imitate SQL's form, rather than its principal of correspondence, has been counter-productive and prevented the reproduction of its merits. Case in point: Cypher.
SQL is so long lasting because its math, sort of. It's a language for expressing data relationships in a very systematic logical and set theoretic way, not a random ad hoc hack.
Its syntax is a bit old fashioned and I do think efforts to make it more native to programming environments rather than a bolt on might be fruitful, but its concepts are timeless.
It seems to me that graph databases are far more efficient than relational ones for most tasks.
That’s because all lookups are O(1) instead of O(log N). That adds up. Also, copying a subgraph is far easier, and so is joining.
Think about it, when you shard you are essentially approaching graph databases because your hash or range by which you find your shard is basically the “pointer” half of the way there. And then you do a bunch of O(log N) lookups.
Also, data locality and caching would be better in graph databases. Especially when you have stuff distributed around the world, you aren’t going to want to do relational index lookups. You shard — in the limit you have a graph database. Why not just use one from the start?
So it seems to me that something like Neo4J and Cypher would have taken off and beat SQL. Why is it hardly heard of, and NoSQL is instead relegated to documents without schemas?
Chasing pointers (on the same medium) is usually slower than the access patterns that databases usually use.
First, a database can have hash indexes instead of btree indexes, so lookups can be O(1) too, but it turns out that btrees are often better because they can return range results efficiently, and finding the range in a btree is only logarithmic for the first lookup. If your index is clustered - if it covers the fields needed downstream in the plan - no further indirections are needed and locality is far better than a hash lookup. For example, a filter could be efficiently evaluated along with the index scan. And if your final results need to be sorted, major bonus if the index also covers this.
Second, it's best to think of databases doing operations on batches of data at a time. Depending on your database planner, different queries will tend to result in more row-based lookups (MySQL generally does everything that isn't a derived table, uncorrelated subquery or a filesort in a nested loop fashion) but others build lookups for more efficiency (a derived table in MySQL, or a hash join in Postgres). The flexibility to mix and match hash lookups with scans and sequential access - which are usually a few orders of magnitude faster than iterated indirection - means it can outperform graph traversal that is constantly getting caches misses at every level of the hierarchy.
The reason NoSQL and document databases suck from a relational standpoint is that they are bad at joins, and work better if you denormalize your joins and embed child structures in your rows and documents. Add decent indexes to NoSQL and they gradually turn back into relational databases - indexes are a denormalization of your data that is automatically handled by the database, but have nonlocality and write amplification consequences which can slow things down.
In terms of distribution, a relational plan can be parallelized and distributed somewhat easily if your data is also replicated - sometimes join results need to be redundantly recalculated or shuffled / copied around - and most analytics databases use this approach, though usually with column stores rather than row stores, again because scanning sequential access is so much faster than indirection. Joins don't always distribute well, is the main catch.
Depends a lot on the actual access patterns of your data.
Many recent web & mobile apps have a lot of screens where you just want to grab one blob of heterogenous data and format it with the UI toolkit of choice. Or if they do display multiple results, it's O(10) rather than O(1000) or O(1M). Chasing pointers is fine for use-cases like this, because you do it once and you have all the information you're looking for.
This is also behind the recent popularity of key/value stores and document databases. If all you need is a key/value lookup, well, just do a key/value lookup and don't pay the overhead of query parsing, query planning, predicate matching, joins, etc. When I was working on Google Search > 50% of features could get by with read-only datasets that supported only key/value lookup. You don't need a database for that, just binary search or hash indexes over a big file.
Oh I agree. Though I think having the rest of a relational DB is really nice should you need more than a simple key-value lookup, and not have to cobble it together through application-side joins and denormalization.
The application I work on in my day job does not match the key/value lookup idiom at all. User-defined sorts and filters over user-defined schema, and mass automated operations over data matching certain criteria. If you squint a bit, the app even looks a bit like a database in terms of user actions.
And even relational databases (at least row-oriented with primarily disk storage) have their limit here. With increasing volumes of data, it can't keep up. We can't index all the columns, and indexes can't span multiple tables. We increasingly need more denormalization solutions that convert hotter bits of data into e.g. in-memory caches that are faster for ad-hoc sorts and filters. Database first is a decent place to start, though having a first-class event feed for updates would certainly be nice...
Depends a lot on the actual access patterns of your data.
Yup. Thing is: with RDBMSs you are in control of both the storage patterns and the access patterns of your data. That's where a large part of the performance benefit comes from.
> 50% of features could get by with read-only datasets that supported only key/value lookup
Did you implement a storage pattern that was ordered by key (or hash(key) if you used hashing)?
A hash index is O(1) no matter what language it's called from. The reason most people don't use hash indices very often is that they do not allow you to do a lot of useful things that you often end up wanting to do, such as retrieving all values within a given range.
Graph databases are great for retrieving existing data with associated data.
The power of SQL comes from the fact that you can easily create new information out of the data: create new sets, group by certain features, aggregates on certain features.
It's a lot more powerful than just store and retrieve.
People forget that SQL isn't just Query (DQL), it's also definition (DDL), manipulation (DML), control (DCL), and transaction control (TCL), language [0].
Checkout that platform that a full-blown Oracle license can provide to your DBA... It's definitely more than just CRUD.
N4J's a lot slower at tons of common tasks. If you look into its underlying data model, it's clear why. It achieves the speed it does at certain graph-traversal operations by storing its data in such a way that it's highly specialized for those operations—one would expect this to come at a high cost for other operations, and sure enough, it does.
It also doesn't bother with tons of consistency guarantees and such that you (may) get from, say, PostgreSQL. Yet is still slower for many purposes.
In practice enterprise applications do a lot of relational operations.
Graph DBs might be valuable for a lot of problems, and it does feel like something like Neo4J would make a lot of sense for stuff like social networks, but for business records stuff isn't really that spread out
> SQL is one the most amazing concepts I've ever experienced. ...and we always seem to try to re-create SQL in those technologies (e.g. Hive, Presto, KSQL, etc).
It's not SQL that's the concept. The concept there is set theory/intersection/union and predicates.
That's why you think you are "recreating" those, because they can be mapped using the same concept
When people say "SQL" they often refer to a bunch of concepts, some better than others, all bundled together.
SQL gets you tables and joins, sure. But it also gets you queries that (some) non-programmers can write, outputs that are always a table, a variety of GUI programs to compose queries and display the results, analytic functions to do things like find percentiles, and tools to connect from MS Excel. And it often means you get transactions, ACID compliance, compound indexes, explicit table schemas, multiple join types, arbitrary-precision decimal data types, constraints, views, and a statistics-based query optimiser.
Of course it also gets you weird null handling, stored procedures, and triggers.
People behind NoSQL movement understand SQL better, than any people glorifying SQL here. In fact, I think this is a testament of how poorly most people actually understand SQL, they can't even see basic on the surface problems with it.
All I heard from NoSQL proponents was that "relational databases don't scale". These same people were usually dealing with data sets that would fit easily onto a single server.
No, I'm pretty sure you only heard it from your fellow RDBMS friends. I have never argued for scale myself, but plenty of times tried to debunk this exact stereotype.
I think there are people who went with NoSQL because it fit their specific needs better than a regular RDBMS, and people who went NoSQL because they found it easier to work with than a RDBMS (at first).
The latter group often didn't want to learn SQL and went with NoSQL's as a shortcut, not because they had an intimate understanding of the tradeoffs between the approaches and decided NoSQL was the better option.
There was the third group of resume driven developers who bought into the MongoDB hype around 6 or 7 years ago, which probably overlapped with the second group a bit.
5 decade old sql is nothing like modern sql with tons of proprietary extensions, partition, windows, collations, typecast, json and god knows what else. Your examples " Hive, Presto, KSQL, etc" are a proof of this, they are so vastly different from each other you cannot simply learn "sql" and expect to use those tools in any serious manner.
This is precisely the proof of opposite that sql has not stood the test of time.
I understood when OP said the "concept of SQL" that they referred to a pretty broad idea of querying tabular data in rows and columns with some structured language.
I find it hard to accept that ppl are talking about 'querying tabular data in rows and columns with some structured language' when they talk about sql. They are surely talking about the specific syntax.
It would be insane to expect somebody arrive at a perfect feature full working version of anything in the first try.
That's true for any piece of software. SQL is not done yet, the fact that you can put in neat little features still into it is a proof to how well it was designed.
Heck we still have Lisp evolving today. A lot of things got done right in the early days.
My first job out of university was on an analytics team at a consulting firm (big enough that you know them) that used MS SQL Server for absolutely everything.
Data cleaning? SQL.
Feature engineering? SQL.
Pipelines of stored procedures, stored in other stored procedures. Some of these procedures were so convoluted that they outputted tables with over 700 features, and had queries that were hundreds of lines long.
Every input, stored procedure, and output timestamped, so a change to one script involved changing every procedure downstream of it. My cries to use git were unheeded (would have required upskilling everyone on the team).
It was probably the worst year of my life. By the end of it I built a framework in T-SQL that would generate T-SQL scripts. In the final week of a project (which had been consistent 60-70 hour weeks), the partner on the project came in, saw the procedures written in my framework and demanded that they all be converted back into raw SQL. I moved teams a few weeks later.
The only good bit looking at it, is that now I'm REALLY good with SQL. It's incredibly powerful stuff, and more devs should work on it.
That happens in the wild unfortunately, I ve seen some sql-Frankensteins here and there but generally, if done properly, sql is the easiest to grok and understand the domain from. I was introduced to SQL circa 2000(by my dad) and I've been heavily using it since.
My bulk of experience in Sql was in MsSql. Recently I got into an Oracle shop and the dialect is a bit different though it didnt take too long to be productive in. However, being productive and such, if you ask me, oracle smells clunky and has way too many features, let's say it's not my cop of tea...
You can do some insanely complicated stuff in a stored procedure. A favourite was versioning and updating a set of data across multiple tables as an atomic transaction.
Also MS SQL let you throw a data object (say XML) at a stored procedure, convert it to a table structure with some XPath (another useful black art like regex) and use that as the input to a single INSERT.
>You can do some insanely complicated stuff in a stored procedure.
You’re giving me flashbacks to stored procs that directly invoke java methods, and batch scrips that load client supplied data from and FTP server into external tables.
Yeah, the business logic was all over the place. Everything needed to be reverse engineered any time there was a problem, or if you wanted to make any changes.
To make it worse, the app had about 300 scheduled tasks that were a mixture of batch files, SQL scripts and java classes. None of which had source code, most of which were slightly different and essentially operated as mysterious black boxes. We ended up having a copy of JD-GUI on all the production servers because we needed to decompile for debugging so often.
There's no reason you can't store your sql/tsql/plsql in version control.
We were doing this 20+ years ago, all code was in csv (we upgraded from rcs to csv), and we had a productized distributed scheduling system that would deploy all the sql scripts every night on a number of oracle databases running from aix, to solaris, to vms, to hpux, to irix, and later linux and windows NT.
Similar like you would now use jenkins to build, deploy and test your java apps.
Developers would never touch the test/production databases, only commit sql to csv. Develop local, sql text files, test on a local or shared development database, and then commit to csv.
What tooling do you need? They're just source code files that you can edit with a code editor or an IDE (DataGrip). In addition you need a file to deploy your sql to a database, this can be a sql or shell script, or a gradle file or even make.
How is this a problem? There are many code editors / IDEs / plain text editors in use with pretty much every other programming language, and code in those languages can be put in version control from all of them. What you're saying is equivalent to “Only Java¹ code from The One True Java¹ IDE with all these features can be put into git”. So which One True IDE is that, and where's your proof that only code from that IDE can be found on, say, GitHub?
> My cries to use git were unheeded (would have required upskilling everyone on the team).
What is the best practice workflow using git with SQL server views/procedures? Can you actually somehow track changes in the views/procs themselves so that if someone happens to run ALTER VIEW, git diff is going to show something?
You can version your scripts & DB as they get pushed with a common tracking number. Personally I think it's best to establish a workflow where manual changes to the DB would be pointless and likely to result in them being overwritten.
Outside of commercial tools dedicated to the purpose: you can query the DB for the content of the procedures/tables and compare them to a given set of scripts, or the most recent expected version. Auto-generated ORM models can be used to validate table/view composition for a given DB/App version, as well. Having these capacities baked into the versioning and upgrade process can do a lot over time to correct deviating schemas and train developers away from meddling with DBs outside the normal update procedure :)
It's mostly about using git to ensure a consistent snapshot as changes are made rather than updating one procedure at a time.
If you are asking about SQL text then a good formatter would make it easy to see the physical diffs. Otherwise there are various vendors and tools that parse SQL and show the logical diffs in queries and schemas, along with doing deployments, backups, syncing changes, etc.
Every conversation I've ever had came back to using RedGate SQL Compare to diff databases and TeamCity for CI.
You basically shouldn't allow anyone to modify anything without it being scripted (bonus points if it comes with a rollback and is repeatable for testing). Your scripts then all go into Git.
RedGate's SQL Change Automation (formerly ReadyRoll) is pretty slick. DbUp is a good, free alternative.
I second the rollback and repeatability bonus. Every script should leave the database in either the new state or the previous good state no matter how many times it's run.
I spent a year in a role where 50% of my duties was writing sql reports. These reports where usually between 500 and 1000 lines of sql a pop. Sometimes the runtime of the report was measured in hours, so learning efficient sql was important. The company had a lot of people that had been writing sql for awhile, and there were lots of cool code snippets floating around. I learned a lot in that year.
I've moved to writing backend code. I'm surprised most of my peers cannot write anything more complicated than a join. Most people are perfectly happy to let the orm do all the work, and never care to dig into the data directly. Every once in a while my sql skills save the day and several people in other departments contact me directly when they need excel files of data in our database we don't have UIs to pull yet.
It's often useful to treat SQL as an extraction/filtering language, and then use R or Pandas as a computation/reporting language.
I think of it as separating I/O and computation. SQL does enough filtering to cut the data down to a reasonable size, and maybe some preliminary logic. And then you compute on that smaller data set in R or Pandas -- iterating in SECONDS instead of hours. The code will likely be shorter as well, so it's a win-win (see examples in my blog post).
I can't think of many situations where hours of runtime is "reasonable" for an SQL query. In 2 hours you could probably do a linear scan over every table in most production databases 10-100 times.
For example, if your database is 10 GB, you could cat all of its files in less than 5 minutes (probably much less on a modern SSD). In 2 hours, you can do a five minute operation 24 times. I can't think of many reports that should take longer than 24 full passes over all the data in the database (i.e. pretending that you're not bothering to use indices in the most basic way). If it takes longer than that, the joins should be expressible with something orders of magnitude more efficient.
I've mainly worked with big data frameworks, but I think that almost any SQL database (sqlite, MySQL, Postgres) should be able to do a scan of a single table with some trivial predicates within 10x the speed of 'cat' (should be within 2x really). They can probably do better on realistic workloads because of caching.
This isn't meant to offend, rather as a point of consideration, but seeing your example use case being 10GB and then talk about big data frameworks makes it hard for me to take this advice seriously.
I might reach for that kind of tooling at the hundreds of TB to PB scale, but in our production applications we have _tables_ that are multiple terabytes. SQL is just fine.
Yes, we also have have queries that run in the timescale of hours and they are always of the reporting/scheduled task variety and absolutely vital to our customers. Long running reporting queries are pretty acceptable (and pretty much the norm since forever) outside of the tech industry and your customers won't balk at it.
It seems like you misunderstood what I wrote. I'm saying you should consider using R or Python if your reports are taking a long time, not big data frameworks.
Big data was a reference to thinking about the problem in terms of the speed of the hardware. If it's 1000x slower than what the hardware can do, that's a sign you're using the wrong tool for the job.
Getting within 10x is reasonable, but not 100x or 1000x, which is VERY COMMON in my experience. These two situations are very common:
1) SQL queries that are orders of magnitude slower than a simple offline computation in Python or R (let alone C++). The underlying cause is usually due to bad query planning of joins / lack of indices.
You might not have the ability to add indices easily, and even if you did, that has its drawbacks for one-off queries.
2) You need to do some computation that's awkward inside SQL. Statistics beyond basic aggregations, iterative computations (loops), and tree structures are common problems.
ETL pipelines that hop between different kinds of storage/computing platforms to exploit local maxima, like you're pointing out with R and SQL working in concert, is pretty common in companies working their way up to BigData and academia.
From the Enterprise side I think too many developers have an unfounded expectations around data storage technology. There's this unchallenged belief that monolithic datastorage that will solve thier problems across the entire time/storage/complexity spectrum. By bringing multiple tools to bear, instead, you end up with more purpose built storage but far less domain impedence.
Slapping a denormalized NoSQL front-end for webscale onto a legacy RDBMS can be a cheap win/win to maximize the capabilities of both. SQL + R is oodles better than R or SQL in isolation.
Your advice is pretty good, but I would definitely say that SQL doesnt have a strong relationship between lines of code and runtime, you can line break (and many do) your wide table's select into many columns and get there pretty quick.
If you are writing SQL regularly, understanding the basics of how the queries you write is not that hard for you engine of choice, and everyone should be required to understand the basics of reading an execution plan so they can find the right inflection points between data gathering and processing.
I regularly sigh write and maintain SQL procedures that are >10k LOC, and their runtime never would exceed minutes, much less hours.
If catting 10gb of files takes 5 minutes than catting 500gb of files takes 250 minutes. This is, of course, an over estimation of how long it takes to look at data on a disk. It's also the most trivial thing you can do with data, read it once.
I think we have a misunderstanding about the scale of data.
EDIT: Reading your post you mention that dataframes stores data in memory. Working with data in ram would provide a significant speedup. It just wasn't possible.
I'm not denying that you could have a query that takes hours and is within 10x the speed it should be, just saying that it's very common to have inefficient SQL queries for reporting. I would say it's almost the "default" state unless you do a bit of work.
There are many interesting queries that don't touch all the data in the database. The 'cat' thing was basically assuming the worst case, e.g. as a sanity check.
Stick the --+ORDERED flag on a random select sometime and look at what happens to the estimated query cost to see how easy it would be to fuck this up if you had to make all the optimisation choices yourself.
I had a similar role where I was writing boring LOB apps in a very gross language, but since we were using an SQL backend for all the data, I instead challenged myself to using bare templates in the actual programming language and writing all the extraction, transforms and logic into SQL selects and (when unavoidable) programmatic bits.
I also learned a crapload about obscure SQL since I would go to extreme lengths to achieve this. There was a lot of meta-SQL programming, where I would use SQL to generate more SQL and execute that within my statement, sometimes multiple layers deep. It was beautiful in its own way, expanding out in intricate patterns.
You might be interested in this project then https://www.getdbt.com/ it's a SQL compiler and executor which has many features like you're talking about (macros to generate SQL and so on). It makes it pretty easy to build up a complex DAG of SQL queries and transformations.
I never personally wrote meta-SQL but it was used at that office and I read it. I didn't see beauty. When I read the code and understood it I felt terror.
Are you asking for “sharing skills” as in something like StackOverflow, but for SQL?
I would think there already is a Stack Exchange for SQL, possibly several (for different RDBMSes/ dialects); go have a look there, if this is what you meant.
Are you asking for examples of advanced SQL skills?
In my experience, if you can grok lateral joins (aka cross apply), recursive CTEs, window functions, and fully understand all the join types, that's a gold star for understanding SQL!
I didn't add indexes mainly b/c for analytic warehouses that are columnar, indexes are less important / meaningless. Partitions though, that's important!
> Most people are perfectly happy to let the orm do all the work, and never care to dig into the data directly.
ORM all-too-often defines data structures from code.
Linus Torvalds wrote,
> I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important. Bad programmers worry about the code. Good programmers worry about data structures and their relationships.
SQL is a mind bender for me. I do a lot of work on Data, use python and pandas to do a lot of data magic, but the problem with me is my mind is too procedural in thinking.
- Step 1
- Step 2
- Loop through results in Step 2
- Curate and finish output.
I try very hard to transform the above steps into an SQL statement spanning multiple tables, but always fail and I usually fallback to python for manually extracting and processing the data.
Does anyone else face this problem?
Any suggested guides / books to make me think more SQL'ley ?
I took a course in prolog, really seemed to help get my mind in the right place.
Basically, instead of thinking in terms of how to get to the result, think instead of the result and figure out how to get there. For example, let's say I need to get a mapping of students to their classes. One way would be to get the students, then loop through that to get classes for each one. Another way to do it would be to ask the database to combine those sets of data for us instead and return the result (join students and class tables on the student ID).
Basically, find the data you want from each table, tell the database about the relationship between them (join on <field>), and set up your constraints. I guess you could think about the query as a function, and you're just writing the unit tests for it (when I call X with parameters Y, I expect Z).
> I took a course in prolog, really seemed to help get my mind in the right place.
I have the reverse problem: every time I motivate myself to learn Prolog I feel like I could just input the constraints in a db and use SQL to get my results.
(1) Think of a database as a "big pile of stuff in a room". Some of it's ordered in a sane way, some of it's not. There is "a person at the door" to the room preventing you from entering - this is the Database Engine, not to be confused with the Database itself. You need to get things out of the room, and you need to put things in the room. You're not allowed to enter. If you want something out of the room, you must ask for it (query) by specifying what the thing looks like and which part of the pile you think it's in. If you want to put something in the room, you must provide explicit instructions for where it must go. When you write SQL, you are either asking a question of or giving instructions to the database engine. The DB Engine is distinct from the Database (the big pile of stuff...which you never have direct access to).
(2) As others have pointed out, SQL is very much the mathematics of Set Theory put into a programming language. Stanford has a free class on Databases. I took it a few years ago (it may have changed), and most of it had no SQL at all. It was fairly very straight forward - things you can do with pencil and paper. It's free, so no pressure. Go sign up and do some of the homework assignments. Do them until you get them right (you can re-take wrongly answered problems immediately). It'll help break your procedural mindset. https://online.stanford.edu/courses/soe-ydatabases-databases
I've thought about the same problem and think I have an explanation. SQL inherently avoids operations that cannot scale. This means that while the same operation can be written much more simpler as code if you're doing with pandas or spark, trying to do it with SQL would be less forgiving if your underlying data schema is not optimal or if your logic isn't scalable. But more often than not, the SQL query will run faster and more reliably given similar amounts of compute power to the two engines.
Of course you can always screw up and write suboptimal SQL that will take forever to finish, but it's just harder to do unless you're really trying to be dumb about it.
So I do think there's still a lot of merit in trying to learn and use SQL more.
It also ignores operations of type (set of rows) => (set of rows) or (row) => (set of rows). and makes you jump through hoops of CTEs to implement them except in some special cases. Limited versions of these (flatMaps) work just fine in spark and have no problems scaling.
Try some functional programming. It will liberate your brain from thinking operationally. You will soon no longer think in terms of steps, loops or anything like that. You will get used to thinking about data transformations, which is what SQL is good at.
I was about to add exactly this. Setwise operations, map/transform, aggregate, avoidance of mutable state, filters, avoidance of if/for with break conditions, quite strong avoidance of exceptions.
I work a lot with data as well primarily I use SAS, R and SQL. I'm also very procedurally oriented (as in I prefer breaking task down into discrete chunks), R's functional way of chaining operations with pipe operator feels so alien to me.
I find myself having the opposite problem I probably use (abuse?) SQL more than should.
My world (industrial plant) is very DB/historian heavy everything speaks SQL it's pretty much the common tongue connecting everything. I think this is slowly changing some PLC/historians now offer a webservice which returns JSON objects via an ajax query - personally I vastly prefer SQL to ajax.
I'd estimate for a typical problem I'm working with maybe 90% is done in SQL vs 10% in R/SAS code.
When I need Regression, Principal Component Analysis, Time Series manipulation, Plots etc I have to break into dedicated language.
For most other things - extract, merge, filtering, high level aggregation (Count sum etc) type of operations using SQL feels more natural and expressive to me.
I used to think this too, and wondered what magical SQL perspective my brain was missing. I'd see these elegant but seemingly impenetrable queries and wonder why my brain wasn't in the "SQL dimension" yet.
But over time, and very heavy SQL reporting, I realized the procedural mindset still applies, it's just not expressed as such.
You need to think through the steps of how you want to retrieve, transform, aggregate etc. the data and build up the query to accomplish those steps. The resulting query doesn't look procedural, but the thought process was. Of course you need to know the sometimes opaque constructs/tricks/syntax to express the steps, which again, look like magic at face value or just reading docs.
I think this is why people struggle with understanding SQL. The thought process is still procedural, but the query isn't. You need to translate the query back into procedural terms for your own understanding, which is a PITA.
Yes, that's what I'm saying. The difficulty people have is that the thought process (solution) is still procedural but the query is expressed declaratively. Once you learn how to navigate that gap you get good at SQL.
I still struggle with thinking too "proceduraly" sometimes, but what really helps out with me is drawing / envisioning the sets of data as a series of circles and Venn diagrams.
It also helps to change the language you use in your inner monologue. Instead of thinking, "For each row in table A...", you should think, "For all the rows in table A that match on...".
If raw speed is not an issue you can always fall back on temp tables or a cursor for those cases where you might have extracted an initial dataset to then iterate over.
EDIT: A CTE (Common Table Expression) might be what you're thinking you need. You can do some fun recursive queries with them.
Highly recommend Dmitri Fontaine's book "Mastering PostgreSQL in Application Development". Really helps emphasize how you can leverage the power of SQL to minimize all the Python / Panda steps to get the same result.
I have the same issue, procedural thinking. I've made some small progress thinking in sets and working to re-phrase my goals in terms of sets of data, sometimes using window functions (row_number(), etc) when I need n-1 items from the query.
Some problems can to be assembled from CTEs / subqueries when you might have reached for a procedural solution to 'finish' the problem. Grasping CTEs was a big thing for me and now I'm wrestling with issues of style mixing CTEs with tiny subqueries and wondering if the increase in readability (for me) is correct.
I'd love to see suggestions for digging deeper on this too.
1) Think in terms of excel spreadsheets. At each step in your SQL you are creating a set of data. Doesn't usually matter how many rites it is, you just want the right columns.
2) Once you have a recordset, decide how the next recordset goes together. I would mentally visualize putting one worksheet next to the other and think through what columns I had to join together to get the rows I wanted.
3) Make sure you know what your first recordset should be. This is the base data that everything can be joined into. I refer to this as the domain of records.
If you're mathematically minded (really just basic set theory), maybe try learning relational calculus. You'll express database queries in terms of Cartesian products, selections and projections. These operations translate directly into SQL, so maybe it'll help you think clearly first, then apply the syntactic translation. Honestly, if I had the choice I'd rather just use the syntax of relational calculus, as I find SQL too verbose and convoluted.
I think what your are doing is fine. SQL is strongest at answering questions not processing data. I think metabase has the right approach: What question do you want to ask your data? If you want to process and transform your data, I think your tools you are using are great for that.
Not trying to play word games, but what is the difference between answering a question and processing data? Aren't they effectively the same?
Using another tool for processing data often results in recreating SQL mechanics at application level. E.g. select this data, retrieve it, loop and if this, then set that, etc. SQL does it way better, guaranteed.
Of course that's often required for technical reasons (scalability etc.) or processing that's too complex to implement at data layer, or just for cleaner design.
«what is the difference between answering a question and processing data? Aren't they effectively the same?»
I think it influences the mindset of the developer. As you say, “retrieve ... if this, then ... loop”. If you're in a “data processing” mindset, then you'll think of a problem like “Get the total number of car widgets in the warehouse” as fetch a widget row; if it's of type car, add number to total; loop until you've processed every row; there you have your total. If, OTOH, you're in an “asking questions” mindset, you'll go: What was the question again, exactly? Oh yes, get the sum of the number for all the widgets which are of type car widgets. Which is almost exactly the same as SELECT SUM(NUMBER) FROM WIDGET WHERE TYPE = 'CAR';.
Processing data is when you do it (in code); answering questions is when the RDBMS (i.e, its code) does it for you. :-)
(At least that's what I think the difference is _in terms of vvkumar's original question._)
Agreed that SQL is amazing at processing data! I would argue that a lot of people are trying to both process their data _and_ ask a question of it in the same statement. Separating those out is really important to make analytics more scalable.
We do >95% of our transformations with pure SQL and the queries are primarily in Looker.
I really like "Beginning database design" by Claire Churcher, and the follow-on book "Beginning SQL queries" . They sound like really basic books, and they definitelystart from the basics, but then build up steadily.
SQL is the most powerful query language ever invented and widely implemented over relation databases IMO (qualified heavily for lurking RDF zealots). Every time you see someone start to invent their own query language, I almost always mark is as folly (similar to when people invent their own configuration languages). Prometheus and GraphQL stand out as recent examples.
DBs like Kafka who recognize this and instead offer SQL on top of their things take the right approach IMO KSQL.
No idea why this is getting downvoted, people must really dislike Datalog.
A recent up-and-coming software tool is the Open Policy Agent[0] and it uses Datalog -- though it's a custom implementation. This comment felt familiar so I went back and looked and it's come up on HN before[1].
While I agree with you, I met a friend of a friend in Denmark who works for PDC who makes the claim that, outside of them, there's only one other respectably-sized company on the planet that is actively keeping Prolog alive.
If I'm going to bet my career on one or the other, it'd be SQL.
Slight aside, I think there are good lessons to be learned from LINQ still.
GraphQL doesn't even seem like the same kind of tool. GraphQL is what you expose as an api to users. It can only do a fraction of what SQL does and thats a very good thing. You certainly don't want users running their own SQL on your server.
I'd disagree -- I (and some other SQL/postgres zealots), would say that allowing users to run SQL on your own server is only a bad thing when you are running a database with insufficient configuration and/or security features.
We just might happen to live in a world where all the RDBMSes have that same failing, but I'd argue Postgres doesn't fail as bad in the are with things like row level security[0].
It's near trivial to exhaust resources when you have SQL access. That's not information disclosure, but it's darn painful if the set of users includes a sizable enough group of bad faith actors.
Some people will even do it in good faith. They’ll tank performance and for some reason it will never cross their mind that they could be causing the issue. Happens with any shared system without rate limiting.
I would 1000% not take my opinion as gospel on GraphQL -- I know how it works, but I have not written an implementation myself, and have not even spun up Apollo. With that heavy grain of salt, here's how I feel about GraphQL:
- It doesn't feel very interoperable nor flexible (it's a different "ecosystem") with regards to other software -- meaning that if it's much less "integrate with GraphQL" and more "build on top of GraphQL" or find the "GraphQL" way to do things. I find this to be different from approaches like OpenAPI/Swagger.
- Complicated GraphQL queries start to look like SQL to me, and I suspect that will become even more the case as time goes on, and people decide they want advanced things that are in SQL, like window functions.
- GraphQL locks you in to knowing the backend design from the frontend -- though this was mostly always the case with REST (need to know which endpoint to call), "true" REST/HATEOAS compliant backends held the promise of you just being able to ask for a shape of an object, and make decisions about whether to dive deeper as you go in a principled way. This kind of gets into the semantic web concepts -- but that's mostly a seas of never delivered promises so...
- It's features are mostly offered by tools like Postgrest (vertical/horizontal filtering[0], resource embedding[1]), which offers less complexity.
- GraphQL is going to very likely spend the next few years running into issues/features that the REST/HATEOAS has already solved (albeit not in a standardized way). I looked at a random issue[2] and this is definitely something that seems quite solved in the RESTish HATEOAS world (at the very least you don't have to wait for GraphQL to do something to solve it).
Basically, I wish someone had built the ease-of-use GraphQL offers on top of the HTTP+REST+HATEOAS model -- because it's wonderfully extensible. Excellent solutions often lose out to "good enough" solutions, and I just... don't want that to happen this time (I mean I don't think it will, it's pretty hard to beat out good ol' http).
We just got to a really good place with Swagger/OpenAPIv3 + jsonschema/hyperschema, in being able to create the standardized abstractions on top of HTTP (and bring with it all the benefits of HTTP1/2/3 as they arrive, and it feels like GraphQL is a distraction. The "entity" graph is just a rehashing of the relational database model and it doesn't feel like it offers much outside of a standardized way of presenting your DB -- if you want to present your DB to frontends why not just let them send SQL queries directly? No I'm not tone deaf, I know most devs don't want to write SQL, but it seems like they're about to get into bed with GraphQL despite the likelihood that it's just going to be another SQL once it has enough features.
That said, GraphQL is doing amazing things for developer velocity, and it deserves praise for that -- tools that grasp mind share this quickly usually offer a large amount of real benefits to the people using it, and I've seen that GraphQL does that.
Write a trivial app with it. I thought "oh yeah, looks ok, but not a huge benefit over REST", now I've started playing with it, I'm saying REST is the new SOAP.
SQL is nice on a surface level and helpful in practice.
Having a working intuition for relational databases is valuable on a deep level. I mean having a sense of how to organize the tables, what sizes are large and small, when to add what kind of index and what the size and speed limits are likely to be for a given data structure. That's extremely valuable.
BTW, we're preparing to move a postgres database that's a few TB from Heroku to AWS RDS. The catch is that we can't afford more than a few minutes of downtime. If this is in your wheelhouse, reach out! We'd like to talk.
Ok, we had a MySQL DB around 500GB, and not on Heroku, but we used the AWS Database Migration Service:
https://aws.amazon.com/dms/
We had just a few minutes of downtime.
I did have few issues around permissions, but an AWS support person walked me through the process and we got it up and running. The support was great, and we just have standard business support.
EDIT: DMS might be a no go, as jpatokal points out. While Heroku allows external connections, it doesn't allow super user access which, AFAIK, DMS requires.
Heroku Postgres doesn't support external replicas, which complicates migrations out considerably. If you can't tolerate more than a few min of downtime for writes, you pretty much have to build the replication at the app level.
> BTW, we're preparing to move a postgres database that's a few TB from Heroku to AWS RDS. The catch is that we can't afford more than a few minutes of downtime.
Would it be ok to serve stale data for an extended amount of time? If so then you could modify your application to write to AWS RDS but to keep reading from Heroku. Then you start copying over the data from Heroku to AWS RDS, and once the copying is done then you point your application to read from AWS RDS.
Another thing: If your tables have foreign key constraints then you need to create the tables in the new DB without them first, since the writes coming to the new DB could be referencing data from the old DB which have not yet been copied over. Once you’ve completed copying the data over you add the FK constraints to the new DB. Depending on how your application was made this could be a problem or it could be ok.
Depending on the size of the relations, something like adding a new FK might result in more than minutes of downtime due to an exclusive lock--although that's a heavily couched "might" given all the work the Postgres team has put towards online DDL.
If you're going Postgres to Postgres I enthusiastically recommend Bucardo.[1] I'm not very technical, but even I managed to get an ~1TB DB successfully replicating using it. In my case it was going from AWS RDS to AWS Aurora. Previously others had tried AWS DMS, but we kept having data integrity issues. And DMS cost us significantly more than Bucardo whose only cost was a medium size EC2 instance and data in/out.
I've been saying this for years, because I started doing web apps when we pretty much were sending raw SQL over to a database. Even after ORMs got bigger, I still stress knowing SQL even if you don't use it often, because it helps you understand the ORM.
And oddly enough, now I'm a data engineer. Everything I do, even when I'm not doing pure SQL, is influenced by years of experience in SQL. Either the languages of my big tools are still based on SQL in some fashion, or it simply helps to have an understanding of the ecosystem to figure out what's going on at scale.
Everything else has come and gone and come again, but SQL has help up pretty nicely. The only other skills that come close for me are working in languages that branched out of C (because the syntax isn't so different over time) and years of on again off again procedural languages (ColdFusion, vanilla Javascript, etc. leads to it being way easier to pick up Python).
I consider myself functional, but not proficient in SQL. Is there a reliable and easy way to assess one's skills in SQL? I feel that most of the new things I learn about SQL these days are database specific. I wonder whether I'm missing out on something, or do I already have the "core" SQL knowledge down, and everything else is special cases?
The core concepts are the most important things to know. How do SQL queries and their various clauses work. Tables, data types, indices, and constraints are all very important. With that knowledge you can handle most SQL databases, and begin to learn and work around their quirks. Yes, there are a lot of database specific details. Procedure languages can be useful such as PL/pgSQL for PostgreSQL and T-SQL for SQL Server.
I recommend you look at one of Markus Winand's presentations, such as this one : https://youtu.be/xEDZQtAHpX8
I felt it summarized well the modern foundation of SQL databases.
Seriously. Each time I have to do something in elasticsearch I'm constantly trying to reason why the creators went about it like this. Yeah hey let's serialize some nested JSON object that contains no less than 2,000 brackets.
Me too. I spent years complaining about databases and all the structure and rules and inflexibility. It was initially great to go to NoSQL backends where suddenly you're free. Now I really miss them :(
Hell Yes.
Although it takes some time to switch from Procedural or OOP to highly Declarative world of SQL, but man, its worth pursuing.
Only thing which really helped me build my confidence was solving more real world problems which involves 2NF and 3NF design, JOINs, Triggers, Indexing, Views & Materialized Views for denormalization, CTE and Recursive CTE's.
You may call me an extremist but from server side programming point of view with Postgres and FDW (Foreign Data Wrappers) which has ton of features other than SQL only thing i miss is HTTP server. :)
I’ve been a career analyst for about 12 years so I write SQL every day. I know it very well. I’ve developed a very strong love / hate relationship with it over the years.
On one hand, you write what you want to find in fairly common language and you almost always get back what you want if you do it correctly. In many ways, it’s like a precursor to Alexa but in written form. It’s super easy to pick up for non technical people.
On the other hand, it’s extremely difficult to code review, and on a very complicated piece of business logic, errors could mean the difference between hiring 10 more people or laying off 100. So almost always it’s just easier to re-write.
Imagine if engineers couldn’t understand each other’s work, and had to rewrite code every time someone leaves the team. It’s insane to me that this is standard practice.
> It’s super easy to pick up for non technical people.
I haven't found that particularly true. I've trained people in SQL and yes, they can do the basics pretty well, but it requires a particular skill to actually write effect complex SQL statements. Non-technical people, being non-technical, generally cannot do that.
> it’s extremely difficult to code review
I don't see why it's any more difficult than any other logic? It's always important to test.
Yes non technical people can’t write complex sql, but they can self service basic questions easier than learning Python or R for example.
And it’s difficult to code review SQL because complex business logic means complex data manipulations and those are hard to visualize and comprehend. You don’t have to think in a 3D space so much when you write python.
SQL is for data systems what IP protocol is for networks: it is the neck of the hourglass. You can build plenty of various things on it (the top of the hourglass: applications, reporting frameworks and so on) using various underlying technologies (the bottom: storage engines etc.) but you can't remove the neck without breaking the hourglass.
This is why SQL language and IP protocol are two most valuable things in computer world.
You've intrigued me with this comment mainly because I know SQL quite well but IP not at all. Is wikipedia[0] for IP a good reference to dive in a bit? Would you recommend something else to grok why IP is so fundamental?
There is not much to grok really. It's just that there's plenty of low level transmission protocols like Ethernet, PPP or CDP, there's plenty of even lower level physical media types like copper or fiber or pigeons. Then there's a lot of higher level things that rely on IP: TCP/UDP, ATP, SPX and so on, and then the entire universe built upon them. But in the middle there is just IP (v4 or v6) and that's it.
Try to do things you'd normally do in code. Many algorithms can be implimented in SQL.
DON'T DO THIS IN PRODUCTION but it's a good way to have fun with SQL and learn a few things.
Find an open source project in an area you're interested in that uses a lot of SQL.
There are a lot of great free resources out there, but I had recently shared https://sqlbolt.com with a friend that wanted to learn SQL and she found it useful.
More SQL learning awesomeness, thanks zappo2938.
It strikes me that the exercises identify the correct result very quickly. They must check against a required result, meaning it is possible for learners to enter a variety of different solutions. Any tips on SQL best practices?
In University, I needed one more CS elective course, and the ONLY class available that fit my schedule was SQL. I had no interest in SQL, wanted nothing to do with it, and only took the class under duress. Even to the point of admitting as much to the prof in a casual conversation (he was a good guy, easy to talk to).
Within two months into my first job out of school, I was assigned to implement a SQL parser as a modern new interface for an ancient proprietary database. Every job since, I've written tons of SQL queries. SQL rocks.
It's very interesting how many application developers shun SQL. I believe a lot of it is due to a pervading sentiment that SQL is unruly or inelegant. There are certainly quirks to the language, and it is a difficult transition to think in a declarative rather than imperative manner, but once you make the jump, it's an invaluable skill to have. Fortunately, I have the luxury of being able to manipulate our codebase from any level of the stack. What this means is that rather than jump through hoops on the API or frontend side to accomplish some task that would require reams of code, I can just get my output via a simple query, and it will perform faster in almost all cases.
From an analytics point of view, I can't imagine not using SQL. I've seen people pull reports from multiple websites, text files, etc., spend an entire day manipulating them in Excel, and still not get their data model working as expected, not to mention that it is very slow. A couple of queries with some temp tables and voila, magic happens. It really does make you look like a superhero when you can deliver more accurate results in a fraction of the time it originally took. I'm surprised there isn't more of a market for this skill, surely there's a lot of programmers from the 80's and 90's who have this skillset in abundance.
I'm a frontend web developer, but for a year I worked on a custom ETL system and wrote lots & lots of SQL. I still find myself using knowledge I learned back then and applying it today - for example, being able to review BI's queries is immensely useful. I'm also much less intimidated when I look at backend code, since a lot of it is interfacing with a database in some way.
I think SQL is one of those essential "secondary" skills for developers.
Reading the comments here is like someone with years of experience with jQuery saying how jQuery is simple and powerful and nothing will replace it. SQL is to relational databases what jQuery is to the DOM, only shittier (maybe like mootools) and refuses to die, probably because all these SQL experts aren't really good programmers. Reading some compare SQL to 70s style procedural code tells me they haven't moved on from expert beginner territory.
I find that programmers with a chip on their shoulder about SQL have poor SQL proficiency. And I don't mean "poor" in the sense of not knowing about window functions. I mean coding a query inside of an enumerator and wondering why they get feedback about poor performance in production.
SQL has opened a lot of doors for me. It's a starting point, not an end though. Learning about dictionaries, lists, and other data structures has proven valuable and compliments SQL and tabular datasets very nicely. I got into those areas by working with SQL generators (ORM's) written in Python (airflow).
My thinking on SQL has evolved and lately I see it as a set definition tool. "Do action X on dataset Y." It's really useful for understanding data structures and data meaning too.
I've worked with more than a few SDE's who look down on SQL, but it's a really good tool when used properly, and it cuts across many technologies. Writing code to write SQL can be very powerful. And sometimes coded or scripted data wrangling without SQL is very useful too.
15 years ago SQL knowledge was not that widespread and it was easy to get tagged as a report writer. Today, a lot more business, product, finance, and accounting people are really strong with SQL, and rely heavily on exporting data to excel for further analysis. Knowing how to answer business questions, get insights out of the data, and define or categorize sets of data are all enhanced by SQL. Report writing is not as much of a thing anymore because people want to view the data in diverse ways.
The barrier to entry is low with SQL, but learning it well takes time and some mistakes to get efficient and precise with it. 15 years later I am still learning new uses for it. One example is JSON querying and transformation which is supported by hive, presto, and some other compute platforms. It's easy to mix and match JSON, arrays, and tabular data structures, in one or more tables, from the same SQL query.
Presto's set of functionality ( ANSI-compliant SQL plus so much more) is a pleasure to use as a data scientist who does a bunch of analysis. JSON, arrays, maps, dates, HLL, spatial stuff, rich aggregation functions, etc.
The only real problem with it is that it's easy to get into some thorny-looking transformations because there is so much to work with.
One way I've solved for this in presto (Athena) is creating a table that flattens the JSON, and views on the table to un-nest arrays or show different grains of data for different use cases. Then transformations and joins to other tables become very simple.
This works well for my current area because the JSON has a consistent and well defined schema.
I recently had a candidate that had very good Scala skills and very poor SQL skills. Despite Scala and spark being powerful, you can over complicate a solution in those languages that could otherwise be easily solved in SQL. That’s interesting to me because I see SQL as a fundamental tool that you build on top of and some people these days seems to skip it.
Do you know how good is the SparkSQL optimization in e.g. Databricks? Conversely I have also seen SQL used in overcomplicated ways to replace a more general programming language.
When scripting in SQL I have seen that in certain SQL flavors there is no good way to split a large file into k smaller ones for a fixed k without creating a new column 1-<number of records>, selecting k new tables modulo the number in the new column, then writing the k files. The way this is implemented is usually naive so the performance is piss poor due to writing and subsequently reading so much data.
If this candidate was very good at Scala, assuming they were a Spark programmer, they could probably do lots of dataframe operations that were as good or better than SQL commands regarding performance no?
I never encountered SQL or at all in my CS education. I got pretty good at Scala for Scalding MapReduce before writing my first SQL query. Learning the basics only took a few days after that.
If someone has enough understanding of data and CS fundamentals to be very good at Spark/Scala, learning SQL isn't going to be much of a barrier for them. I find that the bigger problem is usually people who have a very strong understanding of SQL but no clue as to the resources needed to perform different kinds of operations.
I wonder if we'll one day get something like SQL, but that is based on hypergraphs instead.
Correct me if I'm wrong; I understand you can represent any relational model in hypergraph. And a lot that aren't representable in relational model, but are natural in hypergraph form.
SQL is pretty great. I've been doing advance SQL in Postgres and BigQuery for quite a while and it's one of the most satisfying experiences.
Once you wrap your head around basics such as GROUP BYs, JOINs, CASE statements etc, you move on to advance concepts such as Window functions and there suddenly a new world of possibilities opens up to you for analytics! I've dabbled a bit in PL/pgSQL, but the syntax is way too arcane.
SQL can also be counter-intuitive sometimes. I rewrote a particular PG query and reduced time from ~30minutes down to 2 seconds by adding a subquery. :)
And no, ORMs can't really do what SQL can do in an equivalent manner.
I remember interviewing at FAANG and being asked to code up various tree traversal algorithms... and moments later I would be asked to write window function aggregations in SQL. And it was like this for all the interviews with that company - it was fairly bizarre as I wasn't sure what the aim there was. I understand that SQL is omnipresent, but surely people with algorithmic knowledge would be able to pick up SQL in hours or days, while the opposite doesn't quite hold.
(Oh and I agree with the article and I do like SQL for all sorts of workloads, that's not the point.)
> I understand that SQL is omnipresent, but surely people with algorithmic knowledge would be able to pick up SQL in hours or days, while the opposite doesn't quite hold.
As someone who knows essentially zero SQL, is this really true? How long would it take me to learn it well enough to be competent with it?
In my experience, "algorithmic knowledge" doesn't indicate that you'll be any damn good at SQL.
I think the time it takes depends on what competence looks like for your use case. If we're talking about just retrieving records that match certain criteria, you might get over the hump of understanding joins faster than a non-programmer, and you could probably write 90% of the queries you'll ever need in a week or less.
Reporting is different. Designing schema is different. Troubleshooting data issues in a poorly designed database is way different. Tuning queries and indexes for performance is way different. Not all use cases include those things.
I learned SQL around 1996 or 1997 and had an absolutely fantastic college course or two involving it.
I've mostly been a backend developer... having the ability to go in and fix SQL and make things run in < 1 second instead of 5-10 minutes has been one of the best skills I could have picked up.
It is sad that so often "self described 10x programmers" build solutions to go around SQL that are horrible failures. Poor use of ORMs, weird abstractions at the application layer that force developers to use poor data access patterns, unnecessary "locking" at the application layer, unnecessary "existence checks" at the application layer, processing objects 1-by-1 in the application. All these things lead to terrible performance and huge wastes of application memory/IO.
I love some of the NoSQL solutions too as in some cases they can force teams to use better data organizations/patterns and scale so well. Those patterns are often possible in an RDBMS but the system doesn't guide a team to using those patterns. The way CQL in Cassandra forces you to think about data organization is great for example.
I love SQL. I love SQL for Data Warehousing even more. There’s a school of thought for DW that goes back many years. Many dimensional modeling practitioners know who Ralph Kimball is and also know the Inmon vs Kimball battles. To this day, star schema as a design pattern has helped dozens of analysts in my company abstract the relationship between measures (facts) and how to slice them (dimensions).
Well, I'm not a fan of SQL. To me the math underpinnings are not well exposed, probably because the language doesn't "see" obvious relations and that because of this, joins must be performed explicitly, over and over again, and come to dominate the query.
Isn't it obvious that when a column has the name of a table that the idea is that the columns of that refactored out tables become available as if it were a part of the main table?
When the query works regardless of the structure of the table, the database can be refactored with ease. Most such queries would simply specify column names, and a filter to apply on the records. Any joins needed, and which result from the structure of the database, would be inferred.
Such a "NoJoinDB" would clearly boost productivity, and lots of applications could be written with no explicit join at all. A language like SQL seems to hide the simplicity of most queries!
Please comment about the validity of this perspective.
Another aspect of SQL that I have come to appreciate is it's underlying philosophy: state what you want at a high level with a strict syntax; in exchange, your backend will make aggressive optimisations for you.
This sort of paradigm really made me understand that writing 'high-level' code and performant code are not mutually exclusive.
The reason SQL is great in my opinion is partially what the author highlights. In addition I think one of the reasons is that it is a very mature technology that baked over decades. I first was exposed to ANSI-92 SQL in school and that was already like the 2nd or 3rd revision of the standard. I used it in my first couple of jobs and the skill applied everywhere, Oracle, mySQL, SQL Server and there was even something called Access that understood it and Access was widely used amongst small businesses.
Knowing SQL to me was and even now is extremly helpful to understand data and how to tackle data problems. There weren't many other technologies that had such profound long term applicability, besides maybe knowing and understanding HTML and C. Many new things are just variations and improvements of those core technologies, and those can easily be understood and learned with good foundational knowledge.
You can be a good developer and not know bash. Bash is, in some ways, programming horror. But SQL is pretty fundamental. It's a thing onto itself. There are plenty of shells and shell-like languages but there is really only one SQL.
wrt to shells - the combination of find/grep/sed/awk for locating and transforming textual data makes for its own kind of extremely powerful query language. I've always valued being able to think in pipelines made up of these tools.
And more recently, we see SQL abstractions overlaid on top of things like Kafka streams with messages encoded in Avro thanks to technologies like Spark structured streaming.
SQL is not only relevant to traditional databases. I never got into C#, but LINQ looks really interesting.
The thing is that most of the applications do need the complex data and thus do need need deep SQL knowledge. This is why NOSQL (I think) databases became popular. A DBA is called an "administrator" for a reason - most of the time he spends making sure the data is replicated and backed up/restored and really you need her help when things go to shits. It happens a lot but still in the majority of cases her role is invisible as the devs are trying to abstract the persistent layer (Hibernate and all other ORM frameworks, for java for example) where SQL-compliant database are just a part of the corporate landscape - not because I need it, but because it is there and all the mentioned processes are there.
I fell in love again with SQL while working in my current position. What _really_ made me fall in love was the project dbt[0] which is a SQL compiler and executor. You can build a DAG of transformations all the way from your raw data to tables ready for viewing in your BI tool or consumption by your ML model. I'm still amazed at the things I can do with SQL alone without having to bring Python into the picture.
Also, I had the pleasure of meeting Craig at PyTennessee a few years back. Really great guy and yes, he does seem to wear a hat all the time!
Same experience here. dbt has been a pleasure to work with, and putting it on Redshift with Redshift Spectrum has proven a great pipeline for our startup.
With all the data scientist thought pieces out there, I really wish there was more in-depth discussion of SQL and highlighting of the other shenanigans you can do with query optimization, subqueries, and even analytic/window functions. (although that last one is apparently the Hello World equivalent in data sci: https://twitter.com/minimaxir/status/1094430576092704768)
As a data scientist, I spend a lot more time building queries than working with models.
Ha I could say the same thing about my $EDITOR, which happens to be vim. I learned how to use it about 2 decades ago and I still use it almost daily since. Some things are really worth investing your time learning them.
I think the power comes from knowing something -- _anything_ -- well.
I feel the same about vim and common unix tools (bash, sed, awk etc). If you can use these effectively, these can be very effective productivity tools. The learning curve is steep, just like learning to ride a bicycle, but once you do, it is difficult to imagine life without it.
There's not much going on in this post, so I assume the overall high rating is due to SQL being so damned fantastic.
I commented to a friend the other day that my favorite thing to do on any project is project-wide query optimization. It's like pulling weeds or power washing.
Arguably _every_ application is simply a means of manipulating data. All the other parts are important, of course, but the data comes first. And SQL is a hell of a UI for data. I've tried quite a few other query languages (or structures), even inventing my own once or twice, but none come close.
If you're early in your career and looking for something useful to learn, SQL's a good bet. It will probably outlast every other part of the tech stack you now use. Everything the author says about SQL is true.
But, like asphalt for roads and aluminum for airplanes, it still can stand a lot of improvement. By the way, it's just as important as asphalt and aluminum, and consequently just as hard to change. That's the curse of the customer base.
I wish the SQL world could agree on standards for string manipulation, and stored function / proc programming.
It still amazes my that Don Chamberlain was my dad's college roommate and co-worker at IBM. I always try to weasel college debauchery stories out of him, but as yet have been unsuccessful.
I’ve met far too many devs who hide behind ORMs instead of writing raw SQL and it makes me sad. There’s so much one gives up by allowing an ORM to handle everything for you.
As a dev, I have to use a query builder. It's frustrating because it's ten times more complicated than SQL and slower to execute, with basically zero advantages.
Not really interested in naming and shaming because the lead dev trolls the internet to find people that don't like the framework it so he can publicly mock them.
As for PyPika, I wouldn't be fond of that, but that at least has an ability to get the raw SQL being built. There's no such option in my framework (and I've looked for hours combing the source code trying to find it), so debugging why the data coming out is wrong is kind of a nightmare (especially since the documentation is incoherent).
I can't remember a single job interview where I was asked SQL questions. But I can say that I conducted dozens of interviews for developers and almost always asked for some simple queries.
The first question would be for the statement retrieving an entire table without any restrictions or conditions and 50% would fail giving that query. If somebody would manage to actually build a query involving a JOIN, WHERE and a GROUP BY then that interviewee is pretty much hired ...
This thread has really opened my eyes. I was introduced to SQL on the first day of my first internship, it threw me for a loop and while the job was application development, the industry was finance so we were a very SQL heavy shop. That first "professional developer" experience burned into my memory the idea that all developers knew SQL, to me it was just an implicit thing that everyone knew -- even if they didnt work with it every day.
I agree with the core idea. I've seen all kinds of people use SQL in their jobs even if it's not every day: marketing, finance, developers obviously, etc.
>You seem like a superhero. You seem extra powerful when you know it because of the amount of people that aren’t fluent
That's exactly what I said to newer developers about SQL. Another two examples of this sort of skills are git and regex. Most people knows how to use them but don't dive into them that deep and you just need to spend very little time -hours, or days at most- and you will better than most people at them.
The most important skill that knowing SQL has given me is the ability to think in terms of sets and declarative ways of defining what I want the end result to be (I believe the fancy term is the "projection" :) ).
Yes, it can appear to be magic to people who don't think that way, but doing things with window functions etc and understanding what the output of EXPLAIN means have helped unbelievably when improving performance etc.
I consider myself lucky to have begun my career in earnest at a time just before ORMs became the norm in web development. I wouldn’t say I am an expert, but wow has the sql experience developed early on paid dividends over and over throughout my career. I feel like postgresql in particular is one of the most powerful pieces of software I have ever worked with. You really don’t need much more to build a robust backend!
"like the other mechanistic software theories, the relational database model is a pseudoscience; that it is worthless as a programming concept; and that the relational systems became practical only after annulling their relational features, and after reinstating – in a more complicated form, and under new names – the traditional data management principles." --Software and Mind, pg. 676
While I find SQL useful too, there are other things I value even more than my SQL skills.
Being able to use a POSIX Shell with tools like curl is certainly one of them as it enables me to connect different technologies on a very practical level.
Furthermore, understanding the basic functional programming principles is invaluable if you want to build clean algorithms (doesn't matter if you use Excel, C or Lisp).
ROW_NUMBER(), LAG() and MERGE are the three most powerful operators I use in SQL that are out of the average Joe's knowledge. They are simple to learn yet extremely useful.
Also, after 25 years in the IT business I still can't write a proper PIVOT clause without googling it first. Shame on me.
(btw I'm a MSSQL nerd. Been in Oracle/ODI world for a couple of years but it was too scary)
The only thing I'll disagree with is "SQL is permanent". Nothing is permanent. SQL will live for a long time, but eventually it will be displaced by something else once something displaces RDBMS as the defacto data storage standard of the world and chooses some other interface. That being said, learn SQL, it really is a very very useful thing to know.
Can't agree more. SQL is the one skill that I continue to use across a 15-year span while many others have come and gone - SAS, SPSS, Clementine and name-your-latest-plug-and-play-analytics platform.
It is a skill with the highest bang for buck and is a great way to introduce non-programmers to a basic querying language. It is intuitive, easy-to-read and unbelievably versatile.
As an app developer & architect I'm continually surprised by how often I bump into other techs on projects where BIG EFFORT is being put into all kinds of crazy ORM shenanigans, or subsystem development, or integration because people didn't understand how the underlying database platform could be managed through SQL or how the underlying data could be manipulated through SQL.
Developing an data-driven app without understanding how your data-management solution works is like using just one rollerskate. Sure, you can get places, but...
I believe, that the way data is structured in Prolog is easier and more flexible than in SQL. Sadly I was busy with other things, so I can't tell for sure how it looks for bigger projects and I haven't tried Datalog yet. But I wouldn't consider SQL as a holy grail, after I run into various issues when dealing with not so relational data.
It still surprises me, although it shouldn't, that many developers try to 'hide' SQL through (in mi opinion, convoluted) ORMs. Although I see the apparent beauty in the simplicity they provide to front-end developers, to me, additional layers will always add complications down the line.
My two cents (I haven't read the blog yet so there is that): everybody thinks it's important but when you mention table inheritance or materialized views then the loudest (how to call them not to hurt their feelings...maybe proponents(?)) are the most clueless of all the folks.
I know SQL. But I tend to download data from SQL databases and then use Python and Pandas to deal with it, even though most things could be accomplished in SQL more efficiently.
Maybe Python is more convenient, cognitively? Pandas and SQL are very similar in concept, though.
Understanding relational model is a valuable skill. SQL is useful but it's just the querying language. Understanding the relational model let you use SQL correctly.
Communication. Particularly the ability to cross the boundary between technical talk and non-technical talk.
People like other people who are easy to talk with. Also you will be more likely to build what is actually needed, rather than what someone thought they needed.
One of the most useful books I've ever read was "Thanks for the Feedback" [1]
I've read a lot of the more standard books like "How to Win Friends and Influence People", etc. I feel like this book is similarly valuable, but a lot less well known.
Crucial Conversations was recommended to me on HN, it's also super-useful - on how to talk about difficult/important things, whether in relationships or organizations or everyday life, where most people would be afraid and say nothing. How to speak difficult truths and ensure everyone feels good about it. Not many people would be advanced enough to not get a lot from that book, I think.
HTML is the quickest way to share something permanently. And its client is the most widely available (the browser).
Many people seem to use something like Dropbox or Google Docs now, but my opinion is that HTML still has many benefits, once you know it. I rsync my HTML to a shared hosting account.
In addition to bash, I'd suggest learning more about Unix in general.
I saw this page a few weeks ago. I haven't gone through this particular course, and I don't necessarily endorse it, but I'll just say that the subject matter could have been largely the same 20 years ago. Unix has proven to be an effective way of working and collaborating for a long time.
The need for that skill isn't going away anytime soon. You could argue that a large portion of jobs are munging Excel spreadsheets, and data frames are more of less the programmer's way of accomplishing the same tasks.
You could be a snob and "delegate" those tasks to others, but IME if you really want to get something done that other people can't, you have to roll up your sleeves a bit. (Just like this article suggests, e.g. SQL is useful for both programmers and product managers.)
Yeah, I use AWK constantly, not just for reading files line by line, but whenever I need to e.g. create data in some format, like CSV or XML or for my programs, or to write repetitive code in a program or LaTeX file (e.g. for a lot of images) etc. It's just so quick and easy, such a versatile tool, a pleasure to use. I learnt sed and grep too but rarely use them.
I only got into AWK much later than grep/sed but I was glad to discover some of awks power. Sed is still and will always my first love, but more and more I reach for awk to learn it as there is still so much I don't know.
I'd add that one should also learn LL and (G(LA))LR grammars. Regex are handy for quick hacks, but anything non-regular becomes hacky and overcomplicated compared with a self-describing grammar. PCREs are over-engineered.
There seems to be an equivalent of the "NoSQL" crowd in parsing which appears to be "NoLALR". I can certainly see that other kinds of parsers have their use cases - but LL and LALR parsing have sound theory which is well understood, and they are certainly not obsolete.
This is just an industry getting bigger, more people gain more experience and start to question all the dogmas. LR was taught dogmatically, but was never realistically any good or even useful in practice, hence "NoLALR" was born.
Having a solid working toolkit of SQL knowledge really will make you seem very useful on a team.
I do say "seem" though, in that if you weren't around it's not like the team would fail without you. It's not hard to learn SQL or Google around for a little bit to figure out what you're trying to accomplish.
Sort of like that one guy on your team who's a wizard at regular expressions. He seems like a life saver in the moment, but you'd be able to figure it out without him, too.
Speaking as not an SQL guru -- isn't this true for every skill? Theoretically anyone can figure anything out with enough time, the value is of specialists having better mental maps for what solutions apply, when you need them.
But you're implying then that all skills are equally hard to get a handle on, and that's just not true.
Working with sql is fairly simple even for those who aren't experts. But (as the article mentions) it's not always something a dev has done before or consistently so they'll often feel relieved if there's someone who's confident about it around.
>Working with sql is fairly simple even for those who aren't experts.
Its not though. There are windows, subqueries, joins , various functions, proprietary extensions, partitions, aggregations ect ect. Even if you get all the down, figuring out a performant version of the query is another hurdle, the whole 'simple' abstraction falls apart once you start tuning your query.
Sure, you can get away with some basic SQL knowledge in simple CRUD applications, but any kind of complex reporting and analytics software with a complex data set is going to take a bit more than simple SQL. It's a career specialization, there are people getting paid astronomical amounts of money for their knowledge of SQL alone, and it's their full time job. Some queries are hundreds of lines long, conjure up entirely new databases from the query itself, and trigger complex procedures across the entire application, all within SQL. It can be as complex as any application logic. Simple SQL is pretty easy to learn, but I wouldn't say you "have a handle" on it if all you've got is the basics.
Yes, but I'd say helping out and saving time is just being useful, not "seeming" useful. I suspect like any domain, SQL and DBs have a fractal of complexity, so reading is different from writing is different from architecting, and expertise is useful at each step.
Hmm... I'm not sure. I've seen people take hours or even days (in the case of more junior developers) trying to sort out these kind of issues. It's pretty helpful if you have (or are) someone who can waltz over and fix it in a couple of minutes.
Complex git problems are another one that can really throw people sometimes.
Oh it's definitely helpful! And it's something I should finally take the time to dedicate a weekend to shoring up my knowledge.
Just every time I have to deal with a sql database, my lack of knowledge (at least to this point) has never been a show stopper. It takes me a little more time than a real pro to get my statements figured out, but it's rarely so complex to require canvasing the shop for a true expert.
This is pretty much where I'm at. Though I'm on a mission to really nail fundamentals this year and SQL plus query tuning is the next stop for me. I've got 20 hours of courses related to SQL fundamentals lined up Pluralsight and another 30 odd hours SQL Server / Query Tuning.
The half-life of the knowledge is so long it's just a big boost for a really long time to become an order of magnitude better at it.
Normally if you need a guru it's because you are at a level of SQL complexity that can't be navigated without intuitive knowledge of SQL and it's various nuances.
well it might be a bit too late when you start googling if you didn't have a person on the team that understands your database well. I am always surprised to see that very few people care enough to learn about isolation levels for example.
> Learning SQL once will allow you to re-use it heavily across your career span without having to re-learn.
Like all good abstractions, SQL is the practical expression of a mathematical theory. In the case of SQL, you use Zermelo-Fränckel (ZF) set theory to reason about data sets.
While it is easy to come up with merely conjectural implementations for haphazardly doing things -- arbitrary trial and error, really -- it is notoriously hard to develop a systematic axiomatization such as ZF. Mathematics do not change and are hard to defeat. That is why math-based knowledge and skills are able to span an entire career without having to re-learn.
SQL got kicked off in all earnest when David Child's March 1968 paper, "Description of A Set-Theoretic Data Structure", explained that programmers can query data using set-theoretic expressions instead of navigating through fixed structures.
In August 1968, Childs published "Feasibility of a Set-Theoretic Data Structure. A General Structure Based on a Reconstituted Set-Theoretic Definition for Relations".
Data independence, by relying on set-theoretical constructs, was explicitly called out as one of the major objectives of the relational model by Ted Codd in 1970 in his famous paper "A Relational Model of Data for Large Shared Data Banks" (Communications of the ACM 13, No. 6, June 1970).
Using Codd's work, Donald Chamberlin and Raymond Boyce at IBM, developed in the early 1970s what would later turn out to become SQL.
Relational algebra is a much closer foundation for SQL than ZFC (or plain ZF, because for finite sets the difference doesn't matter). You can say SQL has ZFC as its foundation but that's not very useful. In the same way you can say ZFC is the foundation of everything in mathematics. SQL doesn't even use anything beyond naïve set theory because we are dealing with finite sets, let alone the full power of ZF.
> In the same way you can say ZFC is the foundation of everything in mathematics.
Yes, technically, that is indeed true of every Turing-complete axiomatic theory.
You could conceivably even use the SKI combinator calculus as the foundation of everything in mathematics. Still, something tells me that a database query language in widespread use, based on the SKI combinator calculus is rather unlikely to emerge (certainly not at this point).
Hence, since ZFC is one of the many alternative, equivalent foundations of everything in mathematics, picking ZFC still amounts to making a choice, no matter how small ...
> SQL doesn't even use anything beyond naïve set theory ...
Agreed. However, we also have to consider what situation we came from. David Childs objected to the use of custom, application-specific code, data types and functions, because these custom structures could (more easily) be represented by one, general abstraction. Given the fact that back then, people were not even using naive set theory, his considerations were clearly a step forward in the right direction.
Just to say there is "Project:M36 Relational Algebra Engine" [1], which says about itself:
Introduction
Project:M36 implements a relational algebra engine as inspired by the writings of Chris Date.
Description
Unlike most database management systems (DBMS), Project:M36 is opinionated software which
adheres strictly to the mathematics of the relational algebra. The purpose of this adherence is
to prove that software which implements mathematically-sound design principles reaps benefits
in the form of code clarity, consistency, performance, and future-proofing.
Project:M36 can be used as an in-process or remote DBMS.
You can get to the opposite conclusion with some perspective. Just two decades ago SQL was pretty much a mandatory skill for new programmers. Everyone was learning it, it was in every programming job requirement in any way related to storing some data, etc. Basically databases and SQL were synonymous. But today a Citus guy writes a post attempting to convince people to learn some SQL (Citus is an SQL database company). And last time I used SQL was when Chromium dropped my cookies after an update and I was too lazy to relogin to all the websites (it stores them in an SQLite database). I guess things are not looking so promising for SQL.
Maybe we'll get to a better query language eventually. Distributed systems have some fundamental incompatibilities with SQL and theoretically could standardize on a different language, one that is not just about the data anymore, but that also lets users express and choose various performance and consistency trade-offs. I mean a query shouldn't, for example, invoke a transaction and consensus algorithm when all you need is to increment a view counter or give a one star rating or insert a log entry, those things can be propagated eventually without destroying latency and throughput and without complicating anything with transactional semantics.
SQL is just fine. If incrementing a view counter, giving a one star review or inserting a log entry is important to you, you absolutely want that wrapped in a transaction. If you don't think so now, something will force you to think so at some point in the future. If you wait that long, it's probably too late.
The direction this line of thinking trends in is not thinking or caring about what data you're storing or caring about how it scales. If you're successful enough, this becomes an enormous problem. It literally is a problem that sinks otherwise successful companies.
Not thinking about your data model, not caring about the interactions of your data and not caring about the reliability of your data is a very expensive problem. It's also one where you don't know better until you do. Take some advice from the rest of us, please. We're saying this for your benefit, so that future you doesn't follow in our footsteps.
> If incrementing a view counter, giving a one star review or inserting a log entry is important to you, you absolutely want that wrapped in a transaction.
You absolutely don't. You want it as CRDT operations, not transactions.
That still depends. You're still making a choice of priorities. You're making a choice that you're okay losing some data when failures happen.
For the mentioned use cases CRDT is probably an okay choice, but not always. In some use cases/industries, losing log lines or having inaccurate counts is absolutely not okay. There can even be regulatory concerns here. And fines!
It's also possible to scale traditional databases out pretty far. It's difficult and expensive, but in probably 98% of cases you can do it. If your company has to, it should be able to afford it anyway.
I run a early stage company that builds analytics infrastructure for companies. We are betting very heavily on SQL, and Craigs post rings true now more than ever.
Increasingly, more SQL is written in companies by analysts and data scientists than typical software engineers.
The advent of the MMP data warehouse (redshift, bigquery, snowflake, etc) has given companies with even the most limited budget the ability to warehouse and query an enormous amount of data just using SQL. SQL is more powerful and valuable today than it ever has been.
When you look into a typical organization, most software engineers aren't very good at SQL. Why should they be? Most complex queries are analytics queries. ORMs can handle a majority of the basic functions application code needs to handle.
Perhaps going against Craig's point is the simple fact that we've abstracted SQL away from a lot of engineers across the backend and certainly frontend and mobile. You can be a great developer and not know a lot about SQL.
On the other end of the spectrum are the influx of "data engineers" with basic to intermediate knowledge of HDFS, streaming data or various other NoSQL technologies. They often know less about raw SQL than even junior engineers because SQL is below their high-power big data tools.
But if you really understand SQL, and it seems few people truly today, you command an immense amount of power. Probably more than ever.