Hacker News new | past | comments | ask | show | jobs | submit login
The Database Rant (arohner.blogspot.com)
44 points by arohner on April 22, 2009 | hide | past | favorite | 48 comments



    SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers.
So what? Unix was invented in the 70s because Ken Thompson wanted to play Space Travel. [1]

    Queries are not composable.
Per your example, you want to throw out SQL because using AND and OR doesn't feel right. I'd love to see you compose a 10-line SQL query. Also, views ARE query composition.

    PL/SQL is primitive by today's standards, yet it's the only game in town for certain classes of features, like stored procedures and triggers. 
Untrue. Oracle, the creators of PL/SQL, now let you write stored procedures in any JVM or .NET language (so Jython, Clojure, etc.). MS SQL supports .NET stored procedures. PostgreSQL supports Tcl, Perl and Python. Also, your alternative suggestion of a simple and stupid data store DOES involve fetching all the data from the database into RAM and operating on it in your application. You can already do that with SQL.

    It's not obvious which queries are slow.
Untrue. MySQL can log slow queries and help you analyze them in real time using tools like mytop. I assume similar tools are available for all databases. When you write code it's not always obvious if it will be slow, either.

    No back door for map, filter, reduce
Isn't the WHERE clause filter? Isn't GROUP BY very similar to reduce? SELECT UPPER(firstname) is map? With the ability to define custom functions in different languages that's a fairly powerful tool. Also, comparisons between MapReduce and SQL are misguided, because MapReduce does not operate on tabular data.

Finally, your solution is based entirely on the idea that people will prefer Lisp to SQL.

SQL is not perfect for everything, but if you want to suggest a better alternative, start with a solid argument.

[1] http://en.wikipedia.org/wiki/Unix#History


if you want to suggest a better alternative

How about an RDBMS that exposes a relational algebra api and then people can experiment with developing alternative languages/interfaces instead of having a single designed by committee string-eval based atrocity.


What percent of the top N (your pick of N and your choice of big-ness metric) websites do you suppose are using predominantly string-eval'd SQL? I don't think it's very many; I'd wager that most big shops are quite happily using SQL, using stored procedures, views, and parameterized queries far more often than using "jam SQL into an nvarchar and execute the resulting string".

SQL may have its warts, but in the world of "I gotta get things done and not sweat data persistence", I think it's pretty wildly, successfully, and fairly universally used, almost like the cockroach of the data center.

When I look out and think about what's going to kill our ability to take orders, scale to new levels of growth, or to have development be efficient at developing new features, SQL the language doesn't make my top 20, and probably doesn't even rise to the level of "all other receiving votes". MS-SQL the database engine probably does make my top 10, but that's only because we're continually chopping away at the items on that list. It does take us a while to train brand-new devs on how to write effective SQL, but it takes us time to train them on how to write C#, how to write good test cases, how to structure code for testability and write unit tests, etc. IOW, it's just one of a laundry list of needs, and a proposed alternative interface would also be something on that list to teach them.


> How about an RDBMS that exposes a relational algebra api

MonetDB does this, though rather low-level:

http://monetdb.cwi.nl/projects/monetdb/MonetDB/Documentation...


I should have been more clear in the article. There are two different ways SQL is not composable.

1) predicates can be composed, but only by resorting to string manipulation (i.e. str-join([a,b,c], " AND "), which sucks.

2) Views and subselects CAN compose, but only a few levels. Every DB I've ever seen has huge performance problems with just a few levels of nesting. Imagine your programming language only allowing 3-4 stack frames before it just dies. That's not acceptable in your programming language, so why is it acceptable in your database?

    It's not obvious which queries are slow
Any decently experienced programmer can determine the correct big-O complexity just by reading the code. With databases, you have to read the query, and understand what optimizations the current version of your DB is capable of making.

    No back door for map, filter, reduce
The reason I want that feature is for the case where the DB pukes on your query. What happens when your database is not capable of generating an efficient query for the data you want? Pulling everything into RAM isn't an acceptable solution either. When the DB pukes, you should be able to say "fine, let me give you the exact plan". Then I expect the DB to go off and execute that plan, and give me a stream of data using limit, offset, etc.

The DB allows you to define custom predicates. It doesn't solve the problems of needing to do string manipulation to piece together a query, and it doesn't solve the performance problems of nesting views and subselects.

I never compared SQL to MapReduce. MapReduce is a project created by Google. map, reduce are functions for working on sequences of data created decades before Google was founded.


What happens when your database is not capable of generating an efficient query for the data you want?

Then you use hints or a stored outline/plan stability. This feature has been in Oracle since version 8 over a decade ago.

Clue: MySQL is a looonng way behind the state of the art, don't base your opinions of RDBMS technology on it.


Off topic, I wonder if it's a subjective bias among hardware producers against large amounts of RAM. Cost-wise, I don't think there is a problem to have multi-tera servers or desktops with tens or hundreds of gigs. And yet, you can't buy them. Even though as a software model it makes a lot more sense to keep all the data in ram and only ship the changes to an external memory.


Well, you can buy dedicated devices with lots of RAM (look up violin memory). But, cost-wise, they are quite expensive.


I'm a bit annoyed they don't offer prices up-front, but a very nice product. I wonder if you can use a violin device as primary memory or you can only mount it as a fs.


I think they are mostly used as external storage, sometimes backed by spindles as a very fast SAN.

If you need huge amounts of local RAM then I think you're in the realm of the IBM zSeries and other mainframe devices. Nice stuff indeed, when you have the spare change...


Please quote with '>', not by indenting.


My convention is to quote the submitted article with quotation marks, like this:

"SQL was invented in the 70s for business analysts to write reports without having to go bug the programmers."

And then I quote the parent comment with the HN convention of preceding and following the quoted text with asterisks to italicize it, like this:

Please quote with '>', not by indenting.

But, yeah, definitely not by indenting.


Yeah, I didn't mean to really enforce '>' so much as to discourage the indenting (which is causing annoying horizontal scrolling on my side of things).


Current databases don't give the developer a way to bypass the optimizer.

Sure they do. Oracle and other commercial DBs provide "planner hints", which essentially force the optimizer to use a particular access method / join method for a given part of the plan.

why does SQL decide that declarative style is the One True Way to get access to your data

Well, part of the reason for a query optimizer is that if the physical representation of the data changes, the queries don't need to change. So that if you add or remove an index, you don't need to adjust all the queries on the table, for example. Obviously the query planner is imperfect, but planner hints are fairly common in practice. The further you stray from declarative queries, the more application changes are needed when physical database properties change.

Datatypes don't match your PL

If you want a strongly-typed database system that can be accessed from any language, you need the DB's type system to be distinct from any of the client language's type systems. This isn't always the right tradeoff, of course.


part of the reason for a query optimizer is that if the physical representation of the data changes, the queries don't need to change.

And that's a great reason to use the declarative query as the default. What I'm complaining about is when you need to eek out the last bit of performance out of a query, there should be another option. Sort of like dropping into assembly to optimize a tight loop. Of course you don't want to write the entire program in assembly.

you need the DB's type system to be distinct from any of the client language's type systems.

I don't agree. Just because your DB's type system uses say, Java, doesn't mean that other languages can't access the DB. Non-Java languages would access the DB the same way they do today, but at least the Java developers wouldn't have to convert.


I was all prepared to rant about how wrong he was, but he convinced me.

I've had all kinds of problems with PostgreSQL because its planner would choose a poor query plan in certain situations. Specifically, we don't have a good way of estimating random page fetches for index scans on multi-column indexes.

I also had this weird problem suddenly occur. When one of my tables got to a certain size (about 40 million rows), Postgres started giving a completely different query plan that it thought was faster. The query execution time went from a few minutes to many hours. It took me all day to debug this problem because you simply don't expect the query plan to suddenly change like that.

To tell Postgres that you want the query to run in a different way, you must turn off certain things, like disabling sequential table scans or bitmap scans. You can't tell it specifically how you want it to execute. The reason is pretty simple: SQL is a declarative language. You declare want results you want, and the server figures out how it'll get you there.

I think the author is right. We need a simple language where we can tell the server exactly how we want to get the result. This is essential, because, like the author said, finding the optimal query plan is NP-hard. In practice, most programs run only a few queries. It'd be worth the time invested to try out a few query plans and find out which one is the best and stick with that over having the SQL server replan the query almost every time you submit it.

There could even be third party planners where it could take SQL input and return the resulting executable s-expr if you want to retain that functionality. This way, the development of planners remains separate from the development of the database.


Any attempt to throw away SQL is misguided. If you are going to do server side development, you must know SQL. Thinking in sets is hard work, but you must learn it. For the most part fancy ORM's just get in the way. The solution is not less SQL; it's more SQL. Stored procedures are your friends. When I write a complex query, I develop and test it entirely in SQL, before I write a line of procedural code. It's way easier.

The thing about query optimization is that the optimizer stops when it thinks going on will take longer to perfect it than executing what it has. If you have a query that takes too long look at the query plan. Break up the query into smaller queries and use temporary tables. Give it hints.

Dates are often poorly handle in language libraries. Store them as UTC in the database and pound them into shape in code.

Apologies for the counter rant.


Any attempt to throw away SQL is misguided.

Maybe not throw away, but why should we have only 1 relational language? Functional programmers can choose between scheme, ocaml and haskell. Why should we be stuck with what is in so many ways the COBOL of relational languages as our only option?


What I meant was that, as far as I have seen, the tools that try to replace SQL by some OO intermediary, give you poorer abstractions that limit what you can do. SQL seems clunky, but I think the core is quite well done considering when it comes from.

OTOH a relational relational database doesn't fit all needs. They scale pretty well, but when you get to gigantic data sets, they don't do so well and things like map reduce do a better job. Or cases where you are streaming massive amounts of data and you need performance, not ACID transaction reliability. Actually, we are more in agreement than disagreement.


They scale pretty well, but when you get to gigantic data sets, they don't do so well and things like map reduce do a better job.

I don't think the evidence supports that. See the recent MR vs. parallel DB comparison, for example: http://database.cs.brown.edu/sigmod09/

There isn't a good massively parallel open source SQL implementation right now, but that is just a matter of engineering -- SQL databases have been shown to scale to massive data volumes (e.g. Fox Interactive have ~200 TB of user data in Greenplum, which is a parallel DB based on Postgres).


Well, replacing SQL with a better declarative, relational language is an option. I happen to really like the relational paradigm, but the syntax etc. of SQL has plenty of historical cruft.

(As a bonus, ditching MySQL would probably help shrug off a lot of inaccurate assumptions about real RBDMSs.)


His point about composability is not a complaint about what sql does. It's merely a complaint that he thinks that function composition should be used to combine predicates, but that isn't how sql works.

I.e., he wants "SELECT * from X where P(x)" to be a function acting in X. It isn't a function (or at least should not be thought of that way), it's a description of a set:

{x : x in X, P(x)}

The predicates ARE composable:

{x : x in X, P(x)} intersect {x : x in X, Q(x)} == {x : x in X, P(x) and Q(x)}


Author here. I was thinking in terms of an entire query being a function that returns data.

Also notice that your set notation demonstrates composability, but the equivalent SQL does not.


How is your example SQL "not composable"?

Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.

It doesn't actually "kill" performance at all, as a typical SQL optimizer will rewrite the query to remove the subquery, or else to push the outer predicate down into the subquery (which for this query accomplishes basically the same thing).

I was thinking in terms of an entire query being a function that returns data.

A query returns a relation, and that relation can be used as the input relation to another query, either via a subquery or via views. That seems like pretty powerful composability to me.


A query returns a relation

Actually it doesn't. That's one of the problems with SQL and part of the reason SQL databases aren't truly relational databases.


Actually it doesn't.

Can you elaborate on what you mean? The presence of duplicates is orthogonal, if that's what you're referring to.

The point is that the output of a query can be consumed by another query (either via views or subqueries), which enables query composition and logical data independence.

SQL databases aren't truly relational databases.

People who get hot and bothered about that are almost invariably cranks, in my experience.


People who get hot and bothered about that are almost invariably cranks, in my experience.

Honestly I think you, being a database expert, don't fully appreciate how bad the situation really is. Relational databases are almost universally not understood by programmers. Why? I think it might have something to do with the fact that programmers can't actually use the relational database as a relational database.

They have to programmatically construct a string (seriously, wtf?) to do some relational processing "over there" in the scary database and then serialize the result and import it into a data structure in their language--the complete destruction of the relational idea.

Using a relational database should be similar to using jQuery in javascript programming. The anti-"relational database" backlash is huge right now. And since most of the people involved don't understand what a relational database is they don't understand that it isn't relational databases that they have a problem with.

Have you seen this new thing called Redis? It's "like a key-value store except values can be sets." Wait, it's a "structure server." Pretty soon someone is going to have the brilliant idea of letting the keys be sets also.

I predict that within 3 years the anti-relational people are going to invent the relational database, without realizing it, and be amazed at how awesome it is. Because for the first time in their lives they'll be able to actually use an actual relational database.

That's my database rant.

P.S. As for the SQL != Relational people like Chris Date and Hugh Darwen I don't see how they could be considered cranks. Bitter old men? Maybe. But cranks? No.


Good point, about redis.

I think part of it, also, is that a lot of peoples' exposure to relational databases is via MySQL and PHP, which is ... not representative.

Most people don't understand why the N-normal forms are significant, etc., because they learned SQL from a simple "how to save and load rows" kind of tutorial, rather than learning it from the big relational ideas outward. (I'm a "learn how it works as a whole" guy, but I recognize that most people aren't.) If you don't know why they matter, it just seems like a bunch of bizarre busywork, and then you get lousy performance.


The presence of duplicates is orthogonal, if that's what you're referring to.

Yes. How is it orthogonal?


Because a relation is a function that maps tuples to logical values (true, false, null, etc). We just happen to define these function as table lookups so duplicate rows are indeed not a problem. Thinking of tables as sets instead of as functions causes the confusion.


Dude, noblethrasher, what you just said does not make sense. I have no idea what it means or how to parse it.


My bad.

Suppose I want to define a function that squares integers. Well, I can either define it as a formula:

Square i => i * i

or I can list all of the integers in a table

1 1

2 4

3 9

4 16

etc...

Both of those are perfectly valid definitions for the Square function. The signature of the function is Int -> Int. If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.

Now suppose I want to define a relation (remember, a relation is a function) of employees that work in my company. In a typical RDMS I might create an Employee "table" with the following schema:

emp_id, first_name, last_name.

The signature of this function is Int -> String -> String -> Bool

It just so happens that people (tuples) explicitly listed in the table map to true and anyone else maps to false.

Conceptually it's the same as if I had a function called Employees where I could do something like this:

  Employees(12, 'bob', 'morton') 
which would return true or false depending on whether or not Mr. Morton is an employee in the company. The difference is that in SQL, what is returned is not the result of the function but the definition of the function itself which just happens to look like... a table.


remember, a relation is a function

That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.

If you mean to say a table is a function, then yes. All relations, including functions, can be represented using a table.

If a row is listed more than once (e.g. the row 3 9) it still doesn't affect the definition of my function. 3 still maps to 9.

This is the issue really. I think what you're saying is:

  {1,2,3,3} == {1,2,3}
That's true mathematically, but that's not what's happening in SQL. Try creating the table to represent the square function. Now try inserting (3,9) into it twice. SQL won't let you because it forces tables to represent relations. Not queries though.

In my example {1,2,3} is the direct representation of the value of the set and {1,2,3,3} is a different representation. The tables and queries in SQL are direct representations. If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.

  {{1,2,3},{3}} != {1,2,3}
That's what's happening when a row is listed more than once. It does affect the definition of your function.


> That's backwards. In a Venn diagram relations are the big circle and functions are the smaller circle inside it. A function is a deterministic relation.

That's partially right. In a Venn diagram, whenever you see a little circle S contained in a big circle T what you have is a relation that says S is a subset of T. Specifically this is a relation that maps a set of sets to {TRUE, FALSE}. Fundamentally we have sets and mappings[1] between sets. A relation is a mapping between some set and a set of logical values. So relations are mappings but not all mappings are relations. In the case of SQL a relation is a well-defined mapping so we call it a 'function'.

> If a SQL query looks like |1|2|3|3| then what you actually have, mathematically, is {{1,2,3},{3}}, aka multi-set or bag.

No sir. What you have, formally, is a programmatically generated relation/function (under a closed-world assumption) in which one of the entries in the mapping has been overly specified. Your relation maps the set {1,2,3, 4 ...} to the set {TRUE, FALSE}. If you pass it the value 1 or 2 or 3 you get TRUE. If you pass it the value 4 you get FALSE. The domain of this function is {1,2,3, 4, ...} and the codomain (or range) is {TRUE, FALSE}. Having an entry listed twice in the definition is similar to the following:

  f(x) = 0 when x = 0 and x * x if x is real.
That function has been overly specified. Now if we had a table with entries like this:

  1 Fred Jones

  2 James Smith

  3 Jane Doe

  4 Alice Baker

  3 George Washington
then we have a problem since the 'function' is now ambiguous (and hence not a function).

Now, back to my employee table example. If I have a table called Employees defined as:

EMP_ID (int), FIRST_NAME (string), LAST_NAME (string)

where EMP_ID is the primary key then what I have is a composition of functions (a relation in SQL is always a function)

One function, we'll call it G maps full names to Integers and has this 'signature': S x S -> Z where S is the set of strings and Z is the set of integers.

The other function, we'll call it H maps Integers to {TRUE, FALSE} so my table is actually equivalent to the function H of G.

Now I realize that this is all abstract and a bit pedantic but it is nonetheless what SQL and relational algebra are all about even if we don't ordinarily think of it like this.

[1] Formally, mappings, functions, transformations and operations are all the same things just used in specific contexts.


I think what he's saying is that a relation can be seen as a function that returns whether or not something is present, rather than a set of the results. I think.


Yeah, but any sane database hacker will make sure each SELECT statement returns a relation. It's too hard to reason about otherwise.

That said, SQL sucks.


I was confused by this section too. You state:

> Notice that this solution is composed of the first two queries, however composing in SQL kills your performance, and this is generally not the right way to structure the query.

That sounds like an optimizer issue, not a problem with SQL per se. I often structure my queries that way--it's only sane.

What platform are you using?


The SQL does have much the same syntax, IMHO:

    SELECT * from X where P(x)
    intersect
    SELECT * from X where Q(x)
    ==
    SELECT * from X where P(x) and Q(x)
I did enjoy the rant though. It seems like ORMs like Hibernate are moving to have as much complexity as the database they map...


> It seems like ORMs like Hibernate are moving to have as much complexity as the database they map.

this seems to be a result of the desire to support the increasingly sophisticated concepts provided by the underlying databases, creating an environment in which the ORM's complexity is a function of the complexity of the databases it supports. i'm not sure this is such a bad thing as long as the additions are thoughtfully made.

i do wonder how long it will be until we start seeing the next generation of tools that make hibernate and spring look as crusty as j2ee did when they came out.


next generation of tools that make hibernate and spring look as crusty

I might be biased but spring and hibernate have looked crusty to me from the get-go. Talk about solutions looking for a problem - or rather trading in one problem for two problems each.


My point is that the queries aren't composable, but the predicates are. Think of predicates as basic operations, and "SELECT * FROM X WHERE predicate" as a functor.

Then "composing" Select(P) with Select(Q) is identical to Select(P "composed with" Q).

Composition works, you just need to think of it the right way.


Have you looked at http://en.wikipedia.org/wiki/MonetDB ?

It exposes some of the DB guts at a different layer while leaving a SQL "porcelain" feature available for people who desire that.


It sounds like his ideal database already exists; AllegroCache. It's an object database, but you can easily restrict yourself to certain data types to make it "relational", if that's what you're after.

Some of the open source databases are almost as featureful; Elephant for CL and KiokuDB for Perl come to mind.


I never went through this phase myself, but I see a lot of developers stuck in it. It's this sort of exasperated anger towards a technology that the author has tried his best to understand but failed.

For some reason some people never really wrap their head around Relational Databases, Normalization, and SQL. They seem to not be wired for it. And it understandably pisses them off.

It's people just like this author that are behind all those terrible Object Relational Mapper projects that try to pretend that your data isn't really data. It's just an attempt to escape the obligation of truly understanding how databases work and how they can help you. It's sadly ironic when you think about it.


No, I'm pretty sure I understand the technology. I've written thousands of lines of queries, triggers, stored procedures. I've read the source to Postgres.

In the article, I specifically say ORMs are a terrible idea. I specifically say Relational databases, data integrity, foreign key constraints, declarative queries and ACID are all good things. I forgot to mention normalization, but that's good too. My point is that SQL is a terrible interface to get to all of the good things about RDBMs.


I recently made a GUI for a search utility which is filled in, and then it builds an SQL query from the GUI, and then displays the results of the query in another GUI.

My point is that you can map a search to whatever language you like - whichever is best for that utility.

Plus a friend told me about Linq, which is an interesting evolution from SQL. It lets you break up queries over several cores (though that's just because it's iterative in nature).


how about a 10+ table join on f(any function)?


Those who complain about SQL just haven't used Oracle, fully.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: