Author here. I was thinking in terms of an entire query being a function that re...

neilc · on April 22, 2009

How is your example SQL "not composable"?

Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.

It doesn't actually "kill" performance at all, as a typical SQL optimizer will rewrite the query to remove the subquery, or else to push the outer predicate down into the subquery (which for this query accomplishes basically the same thing).

I was thinking in terms of an entire query being a function that returns data.

A query returns a relation, and that relation can be used as the input relation to another query, either via a subquery or via views. That seems like pretty powerful composability to me.

davidmathers · on April 23, 2009

A query returns a relation

Actually it doesn't. That's one of the problems with SQL and part of the reason SQL databases aren't truly relational databases.

neilc · on April 23, 2009

Actually it doesn't.

Can you elaborate on what you mean? The presence of duplicates is orthogonal, if that's what you're referring to.

The point is that the output of a query can be consumed by another query (either via views or subqueries), which enables query composition and logical data independence.

SQL databases aren't truly relational databases.

People who get hot and bothered about that are almost invariably cranks, in my experience.

davidmathers · on April 23, 2009

People who get hot and bothered about that are almost invariably cranks, in my experience.

Honestly I think you, being a database expert, don't fully appreciate how bad the situation really is. Relational databases are almost universally not understood by programmers. Why? I think it might have something to do with the fact that programmers can't actually use the relational database as a relational database.

They have to programmatically construct a string (seriously, wtf?) to do some relational processing "over there" in the scary database and then serialize the result and import it into a data structure in their language--the complete destruction of the relational idea.

Using a relational database should be similar to using jQuery in javascript programming. The anti-"relational database" backlash is huge right now. And since most of the people involved don't understand what a relational database is they don't understand that it isn't relational databases that they have a problem with.

Have you seen this new thing called Redis? It's "like a key-value store except values can be sets." Wait, it's a "structure server." Pretty soon someone is going to have the brilliant idea of letting the keys be sets also.

I predict that within 3 years the anti-relational people are going to invent the relational database, without realizing it, and be amazed at how awesome it is. Because for the first time in their lives they'll be able to actually use an actual relational database.

That's my database rant.

P.S. As for the SQL != Relational people like Chris Date and Hugh Darwen I don't see how they could be considered cranks. Bitter old men? Maybe. But cranks? No.

silentbicycle · on April 23, 2009

Good point, about redis.

I think part of it, also, is that a lot of peoples' exposure to relational databases is via MySQL and PHP, which is ... not representative.

Most people don't understand why the N-normal forms are significant, etc., because they learned SQL from a simple "how to save and load rows" kind of tutorial, rather than learning it from the big relational ideas outward. (I'm a "learn how it works as a whole" guy, but I recognize that most people aren't.) If you don't know why they matter, it just seems like a bunch of bizarre busywork, and then you get lousy performance.

davidmathers · on April 23, 2009

The presence of duplicates is orthogonal, if that's what you're referring to.

Yes. How is it orthogonal?

noblethrasher · on April 23, 2009

Because a relation is a function that maps tuples to logical values (true, false, null, etc). We just happen to define these function as table lookups so duplicate rows are indeed not a problem. Thinking of tables as sets instead of as functions causes the confusion.

davidmathers · on April 23, 2009

Dude, noblethrasher, what you just said does not make sense. I have no idea what it means or how to parse it.

noblethrasher · on April 23, 2009

My bad.

Suppose I want to define a function that squares integers. Well, I can either define it as a formula:

Square i => i * i

or I can list all of the integers in a table

1 1

2 4

3 9

4 16

etc...

Both of those are perfectly valid definitions for the Square function. The signature of the function is Int -> Int. If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.

Now suppose I want to define a relation (remember, a relation is a function) of employees that work in my company. In a typical RDMS I might create an Employee "table" with the following schema:

emp_id, first_name, last_name.

The signature of this function is Int -> String -> String -> Bool

It just so happens that people (tuples) explicitly listed in the table map to true and anyone else maps to false.

Conceptually it's the same as if I had a function called Employees where I could do something like this:

  Employees(12, 'bob', 'morton')

which would return true or false depending on whether or not Mr. Morton is an employee in the company. The difference is that in SQL, what is returned is not the result of the function but the definition of the function itself which just happens to look like... a table.

davidmathers · on April 23, 2009

remember, a relation is a function

That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.

If you mean to say a table is a function, then yes. All relations, including functions, can be represented using a table.

If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.

This is the issue really. I think what you're saying is:

  {1,2,3,3} == {1,2,3}

That's true mathematically, but that's not what's happening in SQL. Try creating the table to represent the square function. Now try inserting (3,9) into it twice. SQL won't let you because it forces tables to represent relations. Not queries though.

In my example {1,2,3} is the direct representation of the value of the set and {1,2,3,3} is a different representation. The tables and queries in SQL are direct representations. If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.

  {{1,2,3},{3}} != {1,2,3}

That's what's happening when a row is listed more than once. It does affect the definition of your function.

noblethrasher · on April 24, 2009

> That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.

That's partially right. In a Venn diagram, whenever you see a little circle S contained in a big circle T what you have is a relation that says S is a subset of T. Specifically this is a relation that maps a set of sets to {TRUE, FALSE}. Fundamentally we have sets and mappings[1] between sets. A relation is a mapping between some set and a set of logical values. So relations are mappings but not all mappings are relations. In the case of SQL a relation is a well-defined mapping so we call it a 'function'.

> If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.

No sir. What you have, formally, is a programmatically generated relation/function (under a closed-world assumption) in which one of the entries in the mapping has been overly specified. Your relation maps the set {1,2,3, 4 ...} to the set {TRUE, FALSE}. If you pass it the value 1 or 2 or 3 you get TRUE. If you pass it the value 4 you get FALSE. The domain of this function is {1,2,3, 4, ...} and the codomain (or range) is {TRUE, FALSE}. Having an entry listed twice in the definition is similar to the following:

  f(x) = 0 when x = 0 and x * x if x is real.

That function has been overly specified. Now if we had a table with entries like this:

  1 Fred Jones

  2 James Smith

  3 Jane Doe

  4 Alice Baker

  3 George Washington

then we have a problem since the 'function' is now ambiguous (and hence not a function).

Now, back to my employee table example. If I have a table called Employees defined as:

EMP_ID (int), FIRST_NAME (string), LAST_NAME (string)

where EMP_ID is the primary key then what I have is a composition of functions (a relation in SQL is always a function)

One function, we'll call it G maps full names to Integers and has this 'signature': S x S -> Z where S is the set of strings and Z is the set of integers.

The other function, we'll call it H maps Integers to {TRUE, FALSE} so my table is actually equivalent to the function H of G.

Now I realize that this is all abstract and a bit pedantic but it is nonetheless what SQL and relational algebra are all about even if we don't ordinarily think of it like this.

[1] Formally, mappings, functions, transformations and operations are all the same things just used in specific contexts.

silentbicycle · on April 23, 2009

I think what he's saying is that a relation can be seen as a function that returns whether or not something is present, rather than a set of the results. I think.

pradocchia · on April 23, 2009

Yeah, but any sane database hacker will make sure each SELECT statement returns a relation. It's too hard to reason about otherwise.

That said, SQL sucks.

pradocchia · on April 23, 2009

I was confused by this section too. You state:

> Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.

That sounds like an optimizer issue, not a problem with SQL per se. I often structure my queries that way--it's only sane.

What platform are you using?

DougBTX · on April 22, 2009

The SQL does have much the same syntax, IMHO:

    SELECT * from X where P(x)
    intersect
    SELECT * from X where Q(x)
    ==
    SELECT * from X where P(x) and Q(x)

I did enjoy the rant though. It seems like ORMs like Hibernate are moving to have as much complexity as the database they map...

smhinsey · on April 23, 2009

> It seems like ORMs like Hibernate are moving to have as much complexity as the database they map.

this seems to be a result of the desire to support the increasingly sophisticated concepts provided by the underlying databases, creating an environment in which the ORM's complexity is a function of the complexity of the databases it supports. i'm not sure this is such a bad thing as long as the additions are thoughtfully made.

i do wonder how long it will be until we start seeing the next generation of tools that make hibernate and spring look as crusty as j2ee did when they came out.

moe · on April 23, 2009

next generation of tools that make hibernate and spring look as crusty

I might be biased but spring and hibernate have looked crusty to me from the get-go. Talk about solutions looking for a problem - or rather trading in one problem for two problems each.

yummyfajitas · on April 23, 2009

My point is that the queries aren't composable, but the predicates are. Think of predicates as basic operations, and "SELECT * FROM X WHERE predicate" as a functor.

Then "composing" Select(P) with Select(Q) is identical to Select(P "composed with" Q).

Composition works, you just need to think of it the right way.