Hacker News new | past | comments | ask | show | jobs | submit login
Big Bird: killing SQL injection with graph homomorphisms (dustri.org)
35 points by pabs3 on June 28, 2022 | hide | past | favorite | 27 comments



The right way to avoid SQL injection is of course to use parameterised queries. Which also has the nice side effect of speeding up the queries too.

While attending various security vendor demos of static analysis products, they all used SQL injection as their example, and all of them recommended input validation as the mitigation rather than parameterised queries. You'd think they would know better.

It's a cute idea to compare the AST of the query to see if it changes, but then you have to define all valid queries - a whitelisting approach. I'm struggling to see why you would want it unless you had control over a database but had no control over the code that issues the queries.


Sadly it is not always possible in general to completely avoid constructing queries programmatically (which always comes down to string concatenation at some level of the stack) as parameterized queries have some limitations.

For example engines don’t always support a list parameter, meaning an “in” clause with varying number of entries can’t work. (Some might allow table variable parameters as a workaround, but not all).

Another common reason for needing dynamic query construction varying sort by. Once again many databases don’t let you do “order by ? ?” (First being column name, second ASC/DESC).

But a good query builder abstraction with knowledge of your specific DBMS, can help avoid injection here, especially if it does a sanity check of reparsing the generated query and comparing it against an AST it built of the target query. Even when using such a system though, it can make sense to have the query builder build a parameterized query, and use that for passing as many parameters as possible, just as defense in depth. (Avoids bugs in escaping code of query builder, for example.)


ASC/DESC doesn’t technically need parameterization. The point about parameterizing and not query building is about not query building directly from user inputs. It’s fine to build the query fine when you control every part of the input used to build the string. Parameterization is only really for when you need to process a user string, particularly. One mitigation for the column name is to validate it’s in the enumeration of available column names first.

Now of course, if you’re trying to be super strict about parameterization so that static tools / linters / code policy during review don’t complain that’s one thing. But it’s not an invalid choice to utilize domain knowledge to simplify the problem if your tooling doesn’t let you do the right thing.


Right. Concatenating in insufficiently constrained user inputs are dangerous (and trying to constrain a general string enough is usually too risky).

If you can prove that a user input is a harmless value (a valid column name for the table for an order by, an integer, etc) then concatenation is safe. On the other hand, it is still not ideal have a bunch of code where you are directly concatenating strings to generate SQL, since people making later changes are likely to end up modifying code to concatenate unsafe input.

Hence why having a good query builder api can be valuable. It can hide away the string concatenation, and if well designed, can force that a value but in the order by position is a column name, etc, ideally while making it hard or impossible to have insufficiently constrained user input thrown in.

For example, if a language supports distinguishing string literals (or compile time constant strings), and variable string values, it is possible to create a query builder api where arbitrary user provided strings simply cannot become part of the query. (e.g. The only paths that let non-constant strings into the query are those that highly constrain the strings to provably safe values).


It reminds of the quote about Americans sometimes falsely attributed to Churchill saying they'll always do the right thing, "once they've tried everything else" (since it isn't a real quote you'll find a variety of phrasings)

Parametrisation is the Right Thing™ and yet people love inventing new half-arsed workarounds when they could just use parameterisation.

Elsewhere in security, historically you would find any number of hacks to try to avoid needing to use a good Mode Of Operation for your block cipher. People would rather expend mental effort and protocol complexity on trying to still do ECB when that's the Wrong Thing™ than just use the Right Thing™ (these days that's an inherently Authenticated Encryption Mode of Operation like GCM or GCM-SIV)

Or look at all the work trying to shore up human memorable shared secrets (passwords) when you almost never actually needed this and could have instead deployed stuff like WebAuthn.


Parametrization is not enough, there are plenty of scenarios where you need to programmatically construct queries based on user input in more complex ways than parameterized queries allow - including subqueries, lists (IN clauses), sort orders, and perhaps others.


> you need to programmatically construct queries based on user input

You still need to build the queries and have control of what parametres are in use. Why not build the parameters list dynamically as well?. There are some databases or drivers that have limitations where you need to sanitize data yourself, but I'm not sure if that applies to the majority of the scenarios.

This is a simple example of code for constructing dynamic queries:

    query = "... WHERE country = :country";
    params.set("country", country);

    if (userId != null) {
        query += " AND user_id = :userId";
        params.set("userId", userId);
    }
It may become more complex as you add sub queries or if the database does not support adding a list as a parameter, but the principle is the same. You can always just build the list programatically using parameters instead of concatenating the values given by the user directly.

If there are drivers or databases that does not support this, you will need to sanitize the data yourself of course. But I don't think that is the case for most users.

(btw, database drivers that support named parameters are much more convenient than using "?")


I guess there may be SQL frameworks or ORMs that make that hard. I worked on something that allowed parameterisation of any user supplied value, no matter how complex the query. It wasn't that hard, but we could make it do whatever we wanted.


But instead of inventing giant piles of workarounds, put your energy into fixing the database software.


Is this some kind of sick joke I'm too old to understand? Ffs stop gluing together SQL strings! PHP had PDO for decades now, use it!

Oh but PDO doesn't support params in "limit" statement. Than make it support it! Improve PDO's support for "in" parameters as well so that I can pass damn array.

Hack add a "strict mode" to PDO that makes any kind of value in SQL invalid statement.

Running some magic AI than randomly blocks valid SQL is not a solution.


When I first learned about escaping to avoid SQL injection, I was suspicious the whole concept was demented — why can’t I just tell the database that this thing isn’t SQL code?

Then I learned about parameterization, and had my suspicions justified but I didn’t understand why it wasn’t the standard answer.

Finally it turned out PHP mysql lib apparently didn’t support it for ages, and thus young minds have been poisoned by the ridiculous notion of string escaping for decades.

It's still a thing even today with shit like html escaping; I don’t understand why we’re so explicit on so many things in programming, and then as soon as we try talking to another system we just drop everything back to stringly typed nonsense


The concept of escaping is perfectly fine; the demented part is having it turn strings into strings, which allows it to be applied any number of times (including zero).

That can easily be avoided by producing a different type of thing to the input, e.g.

    escapeHTML  : String -> HTMLChunk
    escapeSQL   : String -> SQL
    escapeShell : String -> Shell
    etc.
and having our "execute" API requires that thing as input, e.g.

    returnResponse : HTML -> ()
    runQuery       : SQL -> ResultSet
    runCmd         : Shell -> (ExitCode, Stdout, Stderr)
This way, we're forced to escape (to get a valid argument for the API), and we can't double-escape (since escaping functions require a string as input)


The answer to your question is that PHP's mysql library was pretty much alone, even at the time, in not supporting parametrization. Also your idea for automatic html escaping had been the norm in the Rails community for like 10 years and the JS community is also pretty good, but here I am unclear why not all language communities followed suit.


I don’t think I’m even in favor of automatic string escaping; string escaping generally is a ridiculous concept, for any place where we’re not intentionally mixing code & data e.g. a user-supplied regex — if it’s a user-supplied string that we intend to stuff into the middle of a regex as simple text… we should be capable of simply saying that’s what we want to do. Instead of smearing backslashes everywhere.

Keep the inputs out-of-band and this whole problem doesn’t even exist as a concept


Which is what the PostgreSQL protocol does. It supports sending the query and its parameters separately. HTML on the other hand does not support that.


> The answer to your question is that PHP's mysql library was pretty much alone, even at the time, in not supporting parametrization.

And at the same time everyone and their grandmother was making websites using PHP and MySQL.

There should have been enough incentive to improve this early on.


Especially since everyone else had solved it since before PHP even existed. This problem was invented by PHP.


> Sometimes in 2017, while blotus, bui and I were brainstorming what mitigations we wanted in Snuffleupagus, we really wanted to kill SQL Injections, but it's a hard problem.

Killing SQL injection is not a hard problem. Parsing this statement, however, is.


> It's still a thing even today with shit like html escaping; I don’t understand why we’re so explicit on so many things in programming, and then as soon as we try talking to another system we just drop everything back to stringly typed nonsense

To be fair, most web-browsers today supports Content Security Policy[1], where you can explicitly declare that NO, nothing inside this HTML is ever allowed to be executed as code. The only code allowed comes from script tags, from these domains, which you may optionally require to match a specific hash.

It's not that hard.

Ofcourse the application has to be built to deliver the front-end payload that way for this to work, but it's not exactly rocket-science.

[1] https://developer.mozilla.org/en-US/docs/Web/HTTP/CSP


Also does PDO not fall back to 'emulating' prepared-statements by ultimately sending an interpolated query in certain circumstances with unknown implications?


Killing bug classes like SQL injections is for situations where you can't always do the right thing, like using proper parameterised queries. Maybe you have an old codebase you haven't or won't finish rewriting. Maybe you can't afford people who don't write SQL injection bugs. Maybe you can but they still eventually accidentally add such bugs and you want to be sure.


I like this idea. I thought of something like this a few years ago. A search on "sql injection ast" will show that others have considered this also.

One thing I thought of just now was the unlimited power of the database connection. When a programmer sets up a connection, it is usually for fairly simple things. And then injection happens and the connection happily processes that as well.

Perhaps it would be nice to allow the programmer to define the limited scope when opening the connection. A parameter could be added to specify a list of tables and the expected operations. So the programmer could say that the connection is going to access PRODUCTS and the only expected operation will be SELECT.

Closing the connection could also be modified to return the set of tables accessed during the connection along with the operations observed. Another thing that could be returned is a set of AST's detected with the AST in string form. These AST's could then be incorporated into the source code and also passed to the database when the connection is opened.

The database could then reject SQL that attempts to access tables outside of the specified set of allowed tables and operations. Likewise, unexpected queries could be failed.

The beauty of this is that the management of the whitelisted items is in the domain and control of the programmer. Not specifying whitelisted items is possible still and gives the status quo if that is what is wanted. It also empowers and rewards the programmer would really, really does not want to be responsible if things go crazy.


I sort of wonder if there's an alternate syntax that would give everything SQL needs but would not be as vulnerable to injection. Either that or a binary mapping that would trivially allow a text front end but have length-delimited parts.

Little Bobby Tables deserves to have his name accurately represented in the database. This could involve proper escaping (difficult) or parameterized queries/prepared statements (easy). But many sqli mitigations just reject his name out of hand.

I will say that sometimes I'm a little thankful for SQL injection, or at least a little bit of leeway for abuse. Many applications implement partial search using "like '${str}%'" and if they don't escape % I can also search for my string in the middle. There have been times when I'm able to inject additional columns or rename things to exfiltrate data that should have been available but the app developer didn't make it accessible. Unfortunately, in the wrong hands it is able to break things. Even accidentally.


The problem is not really syntax. Almost any syntax which embeds user supplied strings is vulnerable to syntax hacks.

The only way I could see to avoid an embedded string altering syntactic meaning is if the syntax either gives the length of the parameter before it, or surrounds the parameter with some unguessable values. For example:

SELECT "10:xxxxxxxxxx" FROM USERS; SELECT "s63hfe&:xxxxxxxxxxs63hfe&" FROM USERS;

Where xxxxxxxxxx is the user supplied value.

Both of these are totally impractical of course.


"just" use concat or similar (depending on your database flavour) with "?":

    #!/usr/bin/env perl
    use 5.020_000;
    use warnings;
    use DBI qw<>;
    my $dbh = DBI->connect('dbi:SQLite:dbname=:memory:');
    my $sth = $dbh->prepare('SELECT ("quux-" || ?) AS foo');
    $sth->execute('bar');
    say $sth->fetchrow_hashref->{foo};
... gives:

    $ perl c.pl
    quux-bar


I was thinking like an application of JSON that would be more likely to be produced by stringify than by concatenation. Then escaping would come "for free". It wouldn't guarantee the right thing but it would make it much more likely.


This is a neat DB-side last line of defense! I agree that parameterization is the Right Way to ensure safety, but if we're being realistic you'll never actually get everyone to do it the right way, so it's worth putting in safety mechanisms like this that you can implement inside the library where it'll benefit everyone.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: