Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.
It doesn't actually "kill" performance at all, as a typical SQL optimizer will rewrite the query to remove the subquery, or else to push the outer predicate down into the subquery (which for this query accomplishes basically the same thing).
I was thinking in terms of an entire query being a function that returns data.
A query returns a relation, and that relation can be used as the input relation to another query, either via a subquery or via views. That seems like pretty powerful composability to me.
Can you elaborate on what you mean? The presence of duplicates is orthogonal, if that's what you're referring to.
The point is that the output of a query can be consumed by another query (either via views or subqueries), which enables query composition and logical data independence.
SQL databases aren't truly relational databases.
People who get hot and bothered about that are almost invariably cranks, in my experience.
People who get hot and bothered about that are almost invariably cranks, in my experience.
Honestly I think you, being a database expert, don't fully appreciate how bad the situation really is. Relational databases are almost universally not understood by programmers. Why? I think it might have something to do with the fact that programmers can't actually use the relational database as a relational database.
They have to programmatically construct a string (seriously, wtf?) to do some relational processing "over there" in the scary database and then serialize the result and import it into a data structure in their language--the complete destruction of the relational idea.
Using a relational database should be similar to using jQuery in javascript programming. The anti-"relational database" backlash is huge right now. And since most of the people involved don't understand what a relational database is they don't understand that it isn't relational databases that they have a problem with.
Have you seen this new thing called Redis? It's "like a key-value store except values can be sets." Wait, it's a "structure server." Pretty soon someone is going to have the brilliant idea of letting the keys be sets also.
I predict that within 3 years the anti-relational people are going to invent the relational database, without realizing it, and be amazed at how awesome it is. Because for the first time in their lives they'll be able to actually use an actual relational database.
That's my database rant.
P.S. As for the SQL != Relational people like Chris Date and Hugh Darwen I don't see how they could be considered cranks. Bitter old men? Maybe. But cranks? No.
I think part of it, also, is that a lot of peoples' exposure to relational databases is via MySQL and PHP, which is ... not representative.
Most people don't understand why the N-normal forms are significant, etc., because they learned SQL from a simple "how to save and load rows" kind of tutorial, rather than learning it from the big relational ideas outward. (I'm a "learn how it works as a whole" guy, but I recognize that most people aren't.) If you don't know why they matter, it just seems like a bunch of bizarre busywork, and then you get lousy performance.
Because a relation is a function that maps tuples to logical values (true, false, null, etc). We just happen to define these function as table lookups so duplicate rows are indeed not a problem. Thinking of tables as sets instead of as functions causes the confusion.
Suppose I want to define a function that squares integers. Well, I can either define it as a formula:
Square i => i * i
or I can list all of the integers in a table
1 1
2 4
3 9
4 16
etc...
Both of those are perfectly valid definitions for the Square function. The signature of the function is Int -> Int. If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.
Now suppose I want to define a relation (remember, a relation is a function) of employees that work in my company. In a typical RDMS I might create an Employee "table" with the following schema:
emp_id, first_name, last_name.
The signature of this function is Int -> String -> String -> Bool
It just so happens that people (tuples) explicitly listed in the table map to true and anyone else maps to false.
Conceptually it's the same as if I had a function called Employees where I could do something like this:
Employees(12, 'bob', 'morton')
which would return true or false depending on whether or not Mr. Morton is an employee in the company. The difference is that in SQL, what is returned is not the result of the function but the definition of the function itself which just happens to look like... a table.
That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.
If you mean to say a table is a function, then yes. All relations, including functions, can be represented using a table.
If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.
This is the issue really. I think what you're saying is:
{1,2,3,3} == {1,2,3}
That's true mathematically, but that's not what's happening in SQL. Try creating the table to represent the square function. Now try inserting (3,9) into it twice. SQL won't let you because it forces tables to represent relations. Not queries though.
In my example {1,2,3} is the direct representation of the value of the set and {1,2,3,3} is a different representation. The tables and queries in SQL are direct representations. If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.
{{1,2,3},{3}} != {1,2,3}
That's what's happening when a row is listed more than once. It does affect the definition of your function.
> That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.
That's partially right. In a Venn diagram, whenever you see a little circle S contained in a big circle T what you have is a relation that says S is a subset of T. Specifically this is a relation that maps a set of sets to {TRUE, FALSE}. Fundamentally we have sets and mappings[1] between sets. A relation is a mapping between some set and a set of logical values. So relations are mappings but not all mappings are relations. In the case of SQL a relation is a well-defined mapping so we call it a 'function'.
> If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.
No sir. What you have, formally, is a programmatically generated relation/function (under a closed-world assumption) in which one of the entries in the mapping has been overly specified. Your relation maps the set {1,2,3, 4 ...} to the set {TRUE, FALSE}. If you pass it the value 1 or 2 or 3 you get TRUE. If you pass it the value 4 you get FALSE. The domain of this function is {1,2,3, 4, ...} and the codomain (or range) is {TRUE, FALSE}. Having an entry listed twice in the definition is similar to the following:
f(x) = 0 when x = 0 and x * x if x is real.
That function has been overly specified. Now if we had a table with entries like this:
1 Fred Jones
2 James Smith
3 Jane Doe
4 Alice Baker
3 George Washington
then we have a problem since the 'function' is now ambiguous (and hence not a function).
Now, back to my employee table example. If I have a table called Employees defined as:
where EMP_ID is the primary key then what I have is a composition of functions (a relation in SQL is always a function)
One function, we'll call it G maps full names to Integers and has this 'signature': S x S -> Z where S is the set of strings and Z is the set of integers.
The other function, we'll call it H maps Integers to {TRUE, FALSE} so my table is actually equivalent to the function H of G.
Now I realize that this is all abstract and a bit pedantic but it is nonetheless what SQL and relational algebra are all about even if we don't ordinarily think of it like this.
[1] Formally, mappings, functions, transformations and operations are all the same things just used in specific contexts.
I think what he's saying is that a relation can be seen as a function that returns whether or not something is present, rather than a set of the results. I think.
Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.
It doesn't actually "kill" performance at all, as a typical SQL optimizer will rewrite the query to remove the subquery, or else to push the outer predicate down into the subquery (which for this query accomplishes basically the same thing).
I was thinking in terms of an entire query being a function that returns data.
A query returns a relation, and that relation can be used as the input relation to another query, either via a subquery or via views. That seems like pretty powerful composability to me.