I was at a company that switched from Redshift to Snowflake. It was a night and ...

MrPowers · on Sept 30, 2020

Snowflake seems like a unique product and I can only imagine the complex math they're doing under the hood to achieve these incredible query times. memsql is the only real competitor I know of. Redshift is a lot less user friendly (constant need to run vacuum queries). Parquet lakes / Delta lakes don't have anything close to the performance.

Predicate pushdown filtering enabled by the Snowflake Spark connector seems really promising. Lots of companies are currently running big data analyses on Parquet files in S3. Snowflake has the opportunity to grab a huge slice of the big data market.

ruang · on Oct 1, 2020

What kind of math is involved in building a faster database? Genuinely curious. I would guess maybe linear algebra, indirectly.

acidbaseextract · on Oct 1, 2020

Not at all. I'd highly recommend CMU's 15-445/645 Intro to Database Systems course (sponsored by Snowflake lol) because they put all their lectures online on YouTube [1]! Here's what's involved in making fast databases from the syllabus [2]:

This course is on the design and implementation of database management systems. Topics include data models (relational, document, key/value), storage models (n-ary, decomposition), query languages (SQL, stored procedures), storage architectures (heaps, log-structured), indexing (order preserving trees, hash tables), transaction processing (ACID, concurrency control), recovery (logging, checkpoints), query processing (joins, sorting, aggregation, optimization), and parallel architectures (multi-core, distributed). Case studies on open-source and commercial database systems are used to illustrate these techniques and trade-offs. The course is appropriate for students that are prepared to flex their strong systems programming skills.

[1] https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

[2] https://15445.courses.cs.cmu.edu/fall2020/syllabus.html

ponker · on Oct 1, 2020

Oof... CMU courses directly sponsored by Snowflake. Gross.

red_admiral · on Oct 1, 2020

Please elaborate? I can see a lot of ways a sponsored course could go badly, but I can't immediately see which ones apply here.

ponker · on Oct 1, 2020

I'm not qualified to evaluate this particular course. But any time there is a corporate sponsor of a course, it provides strong incentives to the professor to not harm that sponsor at a minimum. If there's a methodology that the professor would like to teach, but that sidesteps, or calls into question, the sponsor's main offering, then that content is in jeopardy. The corruption will always take root given enough time, so that's why editorial and advertising, or academic content and corporate sponsors, etc. should always be at arm's length. Snowflake should give money to CMU to fund "database-related research and teaching" and the university should decide what to do with it. There's still a possibility of improper influence, but it's harder to achieve. This is particularly bad because it's CMU and not University of Phoenix... CMU is in the highest echelon of computer science universities, so it's sad to see it so debased.

What if Kodak sponsored an imaging class in 1990... what do you think they would have said about film vs. digital photography?

Rochetshipz · on Oct 1, 2020

A lot of ML classes at CMU (and probably other prestigious campuses) are sponsored by AWS or GCP through cloud credit donation, including the popular Cloud Computing class. Is that any different ?

ponker · on Oct 1, 2020

Not really. Cloud computing has a lot of benefits, but a lot of risks and drawbacks. Who is sponsoring a class to teach about those? About keeping users’ data private by building your own infrastructure? CMU is actively tilting their students, who are the top CS students in the world, towards cloud computing, based on the choices of these sponsors.

mensetmanusman · on Oct 1, 2020

Sounds kind of conspiratorial.

I think any increase in educational content is good, even if ‘bad actors’ are funding it.

ponker · on Oct 1, 2020

Bad actors funding it always leads to bad actors writing it. Then it's hard to argue that an increase in its quantity is good.

javajosh · on Sept 30, 2020

>I can only imagine the complex math they're doing under the hood to achieve these incredible query times

Maybe its cynical/paranoid, but in this age of Theranos I must ask: is it possible their algorithm excels at showing you a reasonable looking number, rather than an accurate one?

dumbfounder · on Sept 30, 2020

It's SQL, if they were giving wrong answers people would notice.

kgbdrop1 · on Oct 1, 2020

It's not too terribly difficult to load test Snowflake to get a sense of scaling. Jmeter does the job well. Heck I can pass you along some sample projects I've done against them if you really wanted.

jeffffff · on Sept 30, 2020

yeah redshift is not at all comparable to snowflake. big query is much closer, it's ahead in some areas and in the last year has closed some of the gaps where it wasn't. big query's biggest problem is that it's tied to gcp which is a distant 3rd in cloud marketshare. they have big query omni coming which is multi-cloud but it'll probably be a while before it's comparable to big query in gcp.

philjohn · on Sept 30, 2020

The other problem with BigQuery is that you can very easily write a query that's going to cost you a lot of money to run - with Snowflake you can let it run for an hour or so, and then realise it was a bad idea and you're only out a few credits, a handful of dollars.

The killer feature for me was the query profiler - you can see WHY a query is taking a long time and optimise it - BigQuery just felt like Google were brute forcing the performance, and then charging you accordingly.

When the project I was on switched, the micro-clusters (and the ability to recluster a table) as well as the MERGE semantics beat BigQuery hands down - although those features my be out of beta now (but I've moved on to a new gig).

m0zg · on Sept 30, 2020

That's also a problem that it'd be fairly straightforward for Google to solve by automatically spinning up smaller, entirely separate serving clusters for customers who are worried about such a blowout (for a fee, obvs). It's just the serving tree (+ whatever in-memory storage service they use to do distributed joins nowadays), no need to duplicate the rest of the service. The caveat is, a smaller cluster will favor query optimizations specific to that smaller cluster. Some of those "small cluster" optimizations could hurt query performance when deployed against BQ proper with its tens of thousands of workers.

Also, BQ does explain the query plan to some extent: https://cloud.google.com/bigquery/query-plan-explanation. Not quite at the level of a "regular" SQL DB, but it does give you some info to work with when optimizing queries. If you haven't used it in a while I'd give it another try.

quirmian · on Sept 30, 2020

I believe this is exactly what slot reservations in BigQuery achieve. Instead of paying on-demand pricing that is determined by data read, you purchase a fixed number of “slots” that are shared by queries running within that particular project.

m0zg · on Oct 1, 2020

Ah OK, after reading their docs I see they've changed what "slots" used to mean in Dremel (internal version of BQ). It used to be that slots _guaranteed_ capacity, but did not limit it. Meaning that you could rely on having a certain number of workers in the cluster when you issue a query, but if Dremel had more it'd give you all it's got. Obviously this is not viable when people have to pay per terabyte read, because a ton can be read.

What they have now strikes me as an even better solution to the problem of bankrupting someone with a query IMO. Not sure how pricing compares to redshift et al, but pricing is the easiest thing for Google to change.

manigandham · on Oct 1, 2020

Slots don't control how much data you consume, your query does.

If you need to read a terabyte of data to answer your query then more slots only gets it done faster.

statusgraph · on Sept 30, 2020

BQ Slots lets you do essentially that (pre-commit to a particular cluster size)

theptip · on Sept 30, 2020

I was hitting some rough edges / complexity with BigQuery's MERGE recently, but wasn't able to ascertain any significant difference with Snowflake by scanning their docs briefly -- what aspects of the MERGE semantics are better in Snowflake in your opinion?

Wondering if this is a somewhat new feature in BQ since you used it, or if there's still a feature gap here (e.g. see https://cloud.google.com/blog/products/gcp/performing-large-...).

5ersi · on Oct 1, 2020

BQ has per-project and per-user cost controls. Normally when running new large queries one would run them under a special user with a limit on costs.