Hacker News new | past | comments | ask | show | jobs | submit login
All Abstractions Are Failed Abstractions (Linq vs Raw SQL) (codinghorror.com)
37 points by pchristensen on June 30, 2009 | hide | past | favorite | 33 comments



There's no way fetching IDs then fetching each row one-by-one is faster than SELECT * with the the same WHERE clause, especially taking into account roundtrip times and whatnot. His measurements are wrong (eg. due to caching) and/or he's confused.


He's not selecting each row one-by-one. He's getting 48 records back using two queries. The first query gets the 48 id's of the most recently updated posts. The second query selects the entire row that is associated with each id:

1) select top 48 Id from Posts

2) select * from Posts where Id in (<id's selected from previous query)

Unless the second query is 'abstracted' into 48 individual queries (which would make for an interesting abstraction-article in itself) your accusation is completely baseless.

I think Jeff's notation in the SQL queries were a bit confusing, but if you actually read the article his intention is clear and his point is valid. I'm baffled by the retaliation here on HN. Please rebut me and restore my faith in the HN community.


He says: "Selecting all columns with the star () operator is expensive, and that's what LINQ to SQL always does by default."

Doing a select is expensive relative to what? Assuming your select * where clause has only the index key in it, then select any column in the leaf of the index has the same cost as selecting all columns in the leaf. The expense is the disk read, if necessary.

Also, If you select the entire object then linq does a select * because it has to populate every value of the object. If you only need an id column, just select the id column.

Then he says: "Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms."

He says he could do an IN list, but he never does it. He assumes that selecting on one id is the worst case and then multiplying it by 48 gives him a total. The true time would depend on whether the index pages were in the cache of the database. It also depends on whether the query being run uses parameter versus literals because now compiling/optimizing time is there too.

Specifically, what point are you referring to that is valid?


"Now, retrieving 48 individual records one by one is sort of silly, becase you could easily construct a single where Id in (1,2,3..,47,48) query that would grab all 48 posts in one go. But even if we did it in this naive way, the total execution time is still a very reasonable (48 * 3 ms) + 260 ms = 404 ms. That is half the time of the standard select-star SQL emitted by LINQ to SQL!"

His "naive way" is the 48 individual SELECTs, and he's claiming that that's faster than one SELECT (star).

His point about SQL vs. LINQ may or may not be valid --- I don't know since I don't use LINQ --- but his argument and measurements are certainly flawed.

The nice thing about software such as a database is that, if you spend enough time, it's not some magical black box. If you read some books and papers or go to school or read some source code then you can actually have an understanding of what goes on "under the hood" and you don't have to resort to voodoo measurements. So, if you know something about RDMBSs, you know that once a query retrieves some rows from disk they're cached, so any queries issued shortly after will be served from memory. This is probably what happened here, since 3ms is less than the disks' seek time, not counting networking and processing overhead...

If you're absolutely unwilling to learn something about how a software works and feel compelled to measure it, then at least create meaningful measurements, including averages, medians, deviations, plots...

As a final note, as a developer who's startup is writing a soon-to-be-released high-performance replicated key-value store, if their site is such that performance matters, then these are exactly the type of queries that could be offloaded to some key-value store or even memcached and recomputed asynchronously in the background every minute.


He's not selecting each row one-by-one.

Yes he does: "even if we did it in this naive way, the total execution time is still a very reasonable".

He's getting 48 records back using two queries.

This is equally silly. The two queries would be equivalent to the single one if there weren't two round trips instead of one, more traffic (the ids are retrieved twice instead of once and they are also sent to the server) and the overhead from concatenating and then parsing all the ids in the second query.


I concede. Got confused with the original posts wording, thanks for clearing it up!


I completely agree. Unless the database is doing something odd (ok, not out of the question I guess) like ignoring an index on the id column, it just doesn't seem logical that performing 48 more queries would be faster than allowing the database engine to run it all at once.


There are so many levels to the performance that measuring the surface and making a generalization is practically useless. I find the comments on the article even more baffling. Here are a couple relevant questions:

1. What is the scope of the data context used for these queries? Caching that is keyed by id is done within the scope of the data context: http://blogs.msdn.com/dinesh.kulkarni/archive/2008/07/01/lin...

2. Is the code being run co-located with the database server? If so, this is atypical and these times do not include network overhead.

There are many more questions...


But even if we did it in this naive way, the total execution time is still a very reasonable (48 3 ms) + 260 ms = 404 ms*

I can't read things like this and avoid wondering why people keep reading him. He is so wrong so often I ... it's just beyond me.

Query execution time is just one part of performance. You still have to create the connection and transfer all the data across the wire. Sending N records in 1 big package is much more efficient than sending N+1 records in individual packages.

Run a performance measure on the actual code execution and I am almost positive you'll get worse performance than would be indicated by your query execution times alone.

From now on, I'm flagging every codinghorror post.


It really depends though. If you're using mysql with myisam tables you can easily come up with a sequence of events that lead to N+1 records being much faster due to table locking issues, especially when joins start getting involved. In that case if you can split your query into N+1 requests by primary key and avoid the join that locks both (or worse, many) tables for a little while it'll be much smoother and faster in production.

Now, of course, if you can do that you really should be thinking about memcached and/or a key/value store with possibly a distributed index rather than a relational db anyway.


I wouldn't go as far as flagging every codinghorror post...I would limit it just the ones that are related to actual programming code/coding paradigms...which is admittedly almost all of them.


Jeff is really doing us all a disservice here. Now I'm going to have junior programmers fighting me, citing published evidence that running a thousand queries is actually faster than running a single query with a thousand rows returned.


A strange article, even by Atwood standards. The anecdote doesn't match the moral-- nothing would be altered by removing Linq-to-SQL from the equation.


I agree with you, I don't understand what point the author is trying to prove.

I truly think that mixing code and requests is acceptable only for small projects. Not a single word about stored procedures, which are in a way an abstraction of your database logic.


Every time I read one of these posts, which admittedly is only via HN lately, the suspicion that I am being trolled grows. This almost seems to be borne out by the numerous corrections that are almost always to be found in the comments.

More seriously; in some ways, I feel bad for the position Jeff Atwood is in. On the one hand, I don't think he really intends to speak with the authority implied by his prominence, but on the other hand, he has his prominence.


He appears to be purposefully trolling in this post: http://www.codinghorror.com/blog/archives/001257.html

However, he never admits it. I suspect that although many of his posts might not start out as trolls, he is quite happy to engender a loud reaction without any real concern as to the accuracy or utility of what he writes.


So the Peter Principle applies to blogs, too?

I agree, actually.


There is a certain level of functionality available in all relational databases...LINQ to SQL has been designed to be isomorphic to that functionality. The result of that (plus the very powerful reflection classes in .NET) means that the only problem here is that Microsoft hasn't optimized the expression tree to SQL output as much as possible. After another update or two to the .NET framework, the performance should be virtually identical; it may possibly be even better with LINQ to SQL since it may 'think' to optimize queries in ways that you wouldn't think to.

On a side note, the author also neglects to mention that LINQ to SQL is a great way to make your code more portable to other platforms (Mono) or other databases (MySQL, PostgreSQL, etc.) so that you don't have to learn the intricacies of each databases' SQL syntax.


Only one problem there. Microsoft has pretty much EOL'd LINQ-to-SQL in favour of Entity Framework.

So, the performance improvements and code improvements that you speak of will never get done.


Linq to SQL produces expression trees. These expression trees are interpreted by the query provider. In the case of Linq to SQL the query provider produces SQL for the SQL Server dialect. The SQL can be optimized. This SQL is sent to SQL Server which then parses and optimizes the execution plan for the SQL (or not if it is already in the procedure cache).

Are you saying that the Linq expression to SQL optimization will replace the SQL Server SQL plan optimization?

The gory details of producing a Linq query provider are covered in this series of blog posts: http://blogs.msdn.com/mattwar/pages/linq-links.aspx


No, I'm not saying that Linq to SQL will replace the SQL Server plan optimization; I'm merely pointing out that a well-optimized (and perhaps dynamic, using reflection) query provider could probably generate better, more efficient SQL queries than 99% of developers in 99% of use cases. You might miss out on some potential optimizations simply because you're trying to get the code working right instead of spending hours to shave a few milliseconds off of your query's execution time (plus, nobody's perfect!)


Now I start to understand better why HN readers oppose CH.


Please explain why, so we don't have to actually read the article ourselves. You could be doing other community members a great service by saving them the time to read articles that, according to your comment, aren't worth reading. However, your comment is useless without explaining why we needn't bother to read the article.


The article is full of messy and wrong arguments that are hard to summarize.


It's well named, I have to admit. Whenever he writes something like this I really do feel a kind of horror. I had to look up "Linq", by the way.


I'd be curious to hear the reasoning of the users who 'upvote' Jeff Atwood articles such as this one.


I upvoted this article because the conversation on this page is interesting and provides a lot more information than the original article. I always look at the comments before reading an article on HN, and often feel that they are more valuable and insightful than the linked page. This is the case here.


To everyone who's bashing, he's just saying that it's not a straightforward relationship under the covers of an abstraction. He's still using Linq2Sql and still returning datasets in one query. He just wants people to be aware that even though you have a good abstraction, there are still complex considerations.


Yeah, but at the same time, he's being misleading about the effects of N+1 query problem. It's a problem for a reason.

Anyone experienced enough in web development will maybe get his point, but the newbies who read his blog will not and will fight more senior developers on this topic, all because "Jeff Atwood said it".

That's where the bashing is centered on.


Someone should tell these newbies that "Jeff Atwood said it" is actually an argument against whatever he said.


Someone should, indeed.


I'm sure Google and other large dev shops will be chomping at the bit to buy the premise version of Stack Overflow once they see how sharp Jeff is.


The author doesn't know/understand SQL and he should be kept away from databases.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: