Same Query, Different Hash: Snowflake's query hashing blind spots

ryanwaldorf · 2024-12-18T17:07:02 1734541622

Really interesting! Could you clarify what difference having two different query_parameterized_hash for similar queries is? Is there a performance hit?

dangoldin · 2024-12-18T17:11:45 1734541905

Yea - the idea is that Snowflake will generate these after a query runs in order to help you look at multiple runs of the same query. So imagine you run a query that's "select a from b where c = 1" and you want to find all examples of that query running. That's where "query_hash" comes in. But Snowflake also says well what if we let you be generic about the parameters - so "where c=1" and "where c=2" and "where c=300000" all have the same query_parameterized_hash.

That's the intent but turns out it's only doing a very simple hashing and not actually looking at the canonical version of the query. For example it won't treat aliases/renames as the same even though it should. This makes it harder to look at all queries that are in essence doing the same thing.

ryanwaldorf · 2024-12-18T17:40:49 1734543649

Oh that's really interesting! I imagine there could be a reason for it, for instance the data is distributed differently in the micropartitions so different where values could result in different data lookup patterns as you may skip more/less blocks. But overall this makes a lot of sense!