Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: SQLCheck – Automatically identify anti-patterns in SQL queries (github.com/jarulraj)
61 points by jarulraj on Aug 15, 2017 | hide | past | favorite | 26 comments



The tests in this read like someone who has only used one SQL dialect, and wanted another project to show off C++ abilities. Its full of bad advice that does little to help developers write better code.

CheckJoinCount: “Consider splitting up the complex query into many simpler queries, and reduce the number of JOINs” - And offload the additional work to where, to the client?

CheckSpaghettiQuery: Character count > 500, again "Split up a complex spaghetti query into several simpler queries"

"Rewriting the query’s HAVING clause into a predicate will enable the use of indexes during query processing." - On what engines, and in what circumstances?


The join one is particularly bad as it can make your query run much slower as running multiple queries is almost never a win. A check for the opposite (multiple queries when one is needed) would be better as then the query planner in the SQL engine can do a better job and you have fewer round trips.

While the tool is a good idea, there need to be some proof that each of the patterns it's looking for are actually bad.


Yup, the "CheckSpaghettiQuery" is a recipe for disaster.

I have a query here that runs on a raspberry pi running postgresql with 7 GB of data (chat logs in a postgres table, with fulltext indexing) in about 60ms.

After splitting it up far enough that the system doesn’t complain anymore, the query takes around 30 hours (I know because that’s what I started with before I merged the queries).

This is just a dumb idea in general. You can always filter the data faster in the database, and you reduce unnecessary network round-trips.


I agree that executing fewer SQL queries is often better for performance.

However, it is possible that an elaborate SQL query that has to use many joins, correlated subqueries, and other operations can be harder for the SQL engine to optimize and execute quickly than a set of more straightforward queries.

Furthermore, these queries are simply hard to write, hard to modify, and hard to debug. I am hoping that these checks would serve as hints to developers to simplify and modularize complex SQL queries.


> CheckJoinCount: “Consider splitting up the complex query into many simpler queries, and reduce the number of JOINs” - And offload the additional work to where, to the client?

I'm sure, like many, I've seen both extremes of this. It seems like most developers I've worked with have stopped and generally thought about what they were doing with their queries, but there are those few that leave you bewildered.

1) Developer who produces a software package used by a fair number of counties across the US. They're essentially one of, if not the only, company working in that field. We acted as a go-between between the state and the software company.

There are a set of standard reports included with the software, each of which amounts to little more than a very basic PDF wrapper around an SQL query. A huge, ugly turd of a query. One that they created through some SQL GUI, whenever they created the report.

They hadn't profiled any of them or spent any time tuning them. For a large number of them it took quite a while, even with relatively small numbers of records. The state my employers worked with didn't do things on a small scale. In one query's case it would take upwards of 45 minutes to produce a report under optimal conditions. The indexes were great for the general workload, but lousy for producing the report via that query.

They also clearly hadn't thought about the end user experience. Nothing on the page to indicate progress (hard to demonstrate progress when you're waiting on one monolithic query). It wasn't set up as a nice asynchronous report production either.

While I didn't really have time for it, we wanted to be as helpful to the agency in question as possible, so I took a stab at one report, using a quick python script. Breaking the underlying query apart into a few distinct components, and paying attention to the indexes resulted in 1) producing the report in under a minute. 2) progress bars.

Trying to wrestle that change in approach into the monolithic query was way too complicated for pretty much zero benefit.

2) At another company, a co-worker kept having to reject a deployment request from one particular developer (this was quite a while ago, the company used the waterfall method).

Littered throughout their code was "SELECT * FROM table", followed by implementing all filtering logic inside their code. It got to the point where we celebrated the dev at least putting code forwards that included a WHERE clause: "SELECT * FROM table WHERE", even if we still had to reject it (due to the breaks-on-every-schema-change "SELECT *"). Over and over again we demonstrated that the database server was quicker than his code, and really hoped he'd eventually learn. That was one of the few times I really, really, really wished a team would either use an ORM, or fire the developer.


Thanks for sharing these interesting anecdotes.


https://github.com/jarulraj/sqlcheck/blob/master/docs/logica...

Strongly disagree.

Table columns should always be prefixed with table name. So users.user_id is one of the most hated patters for me. user.id looks so much cleaner. No one in their right mind should ever write the column name by itself. Especially with queries that use more than one tables.


It's pretty annoying to do joins and have to alias all your columns because their names are ambiguous. Also in my experience most people use very terse table aliases so it's not always obvious what "u.ID" "uc.ID" "puc.ID" etc. are referring to.


You don't need to alias "all" of your columns, just about 1 out of 10. And I don't understand the complaint about terse table aliases being non-obvious. The answer key is right below the select clause.


This doesn't allow for finer semantics. Just "u.id" tells us what the thing is, but not its role in the schema.

I don't think it's very unusual to do this. The most obvious example is where there are multiple references to the same entity. For example, an Order table might contain two references to a user - one for the user who placed the order, and another for the person who approved the order.

But it's more general than this. Just saying that it's the user that's associated with the thing is frequently not enough to convey understanding, and naming with the role that the relationship is describing is helpful.


I also prefer using id instead of user_id or company_id. It does mean you'll have to escape it (in MySQL at least) if you are writing pure SQL.

Take the following example:

           SELECT s.`id`,
                  s.label,
                  s.`group`,
                  g.name AS group_name,
                  s.tags,
                  s.ssh_username,
                  s.ssh_port
             FROM servers s
  LEFT OUTER JOIN groups g
               ON s.group = g.`id`
         ORDER BY s.label ASC
In an unrelated note, how does everybody else format their SQL? The above format is what I typically use.



I agree with you in general but I avoid table names or aliases when the query uses only one table because that adds no clarity. Also, table names are often much longer than "user" so aliases can be useful. I do agree that aliases like a, b, c should be avoided and if an alias is required it should be descriptive.

On my team we have a lot of common table aliases which are used by convention. We understand what they mean and it gets the point across without using 20 characters.


Thanks for sharing your perspective. I will add a new check to ensure that table columns are always be prefixed with table name. However, I still think that it is worthwhile to not use a generic primary key name (at least in the common case).


How about "do not use offset for pagination"?

http://use-the-index-luke.com/no-offset


I feel like the author unjustly brushes aside the main benefit of using offsets for pagination (that is: the ability to query a specific page of results). Yeah, most page-based UIs do a lame job at it, but it's nice to be able to say "yeah, the record you want is on the fourteenth row of page 9".

If that's not a relevant use case for your application, though, then by all means use key-set pagination.

As a side note, I happen to dislike most implementations of infinite scrolling (especially when the keep pushing down the page footer, making it impossible to actually read or click on anything down there... GRRRR!).


Sure :) Any other interesting suggestions?

I must mention that I greatly admire your pragmatic books on improving database performance, and they were a source of inspiration for the development of this tool.


A complicated solution to something that is not a problem, at least for my lowly business CRUD apps.


Interesting.

Some bones to pick: https://github.com/jarulraj/sqlcheck/blob/master/docs/query/...

First, dat implicit, inner join. Second, it could be written more clearly that, in this case, the reason you want to use a join is because you are not sure that the sub-query returns at most one row.


One good example is joining to a foreign data wrapper (FDW). Oftentimes, PostgreSQL can freak out about this because the statistics might be off/vague. So instead of making a small query to the FDW, it will do a full table scan :S

A subquery explicitly tells the database "hey, don't get the whole FDW, just this part that I want"


Can you please open an issue or send a PR to update the documentation? Thanks!


Yup sure thing!


If you know any devs at Tableau, please forward this to them.


Thanks for the suggestion. I unfortunately don't, but this tool can certainly be integrated with ORMs and in other BI stacks.


Sounds good in theory. but I wonder how well it works for any SQL dialects. I write SQL in a dialect like T-SQL usually. The tool would probably not recognize the brackets or different join syntax.


By design, the tool is dialect-agnostic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: