One unique thing about Google Cloud is that most managed services like Load Bala...

kasey_junk · on Oct 24, 2015

Not really related to your bigger point, which I have no opinion on, but Kafka & PubSub have different delivery contracts, Kafka's are generally more strict. Therefore comparing the scalability of the 2 is somewhat problematic.

vgt · on Oct 24, 2015

Can you elaborate on that? PubSub is a fully-managed service, which means that Google SREs are on call making sure things are up. In addition, Pubsub has "guaranteed at-least-once message delivery". In a sense, Google's SREs guarantee delivery.

PubSub is also a GLOBAL service. Not only are you protected from zone downtime, you are protected from regional downtime. Is there an equivalent to this level of service anywhere in the world?

I'm not too familiar with Kafka's fully managed service, but Kafka-on-VM is a whole other ball game. YOU manage the service. YOU guarantee delivery, not Kafka.

kasey_junk · on Oct 24, 2015

Kafka promises strictly ordered delivery, PubSub promises mostly ordered. The differences between those promises are what drive PubSubs ability to scale throughput and global availability.

From an availability standpoint, I don't disagree with anything you mention, but the difference between the consistency models means that PubSub is solving a different set of problems than Kafka, thus my opinion that comparing them is problematic.

vgt · on Oct 24, 2015

That's a fair point. But remember, Kafka promises this as long as the underlying VM infrastructure is alive and well. PubSub completely removes this worry, or even the concept of VMs.

There are several ways to look at it, but I'd opine that a "mostly ordered" fully-managed truly-global service that's easy to unscramble on the receiving end is more "guaranteed" than something that is single-zone and relies on the health of underlying VMs that YOU have to manage.

edit: Kafka and PubSub have a lot of overlap, but they each have qualities the other one doesn't. I suppose you gotta choose which qualities are more important for you.

kasey_junk · on Oct 24, 2015

If you can design your protocol such that it can work in a mostly ordered fashion, I'd highly recommend that you do. It opens up your choices for technology stack tremendously. But, if you require ordered delivery, your choices start shrinking dramatically.

Also, just so we are on the same page. Kafka is a software product that can be run on hardware or VMs, not a managed service. Possibly, you are thinking of the Amazon Kinesis product which does offer a managed service with strict ordering.

vgt · on Oct 24, 2015

Agree on first point.

No confusion on second point. My argument was that Kafka adds significant complexity and delivery risk because it's software that you must run on hardware/VMs, rather than a fully-managed service. You have to pay a whole lot of eng time to make Kafka truly "guaranteed delivery" because there's always risk of underlying hardware/VM/LB dying.

Pubsub guarantees delivery regardless of what happens with underlying infrastructure. In a sense, the bar has been raised dramatically.

Pyxl101 · on Oct 24, 2015

> PubSub is also a GLOBAL service. Not only are you protected from zone downtime, you are protected from regional downtime. Is there an equivalent to this level of service anywhere in the world?

Could you point to some of the documentation that describes more about its reliability model and SLA? I glanced through the documentation and couldn't find out any information about this.

It seems like a service that has this kind of global availability would have to make a trade off in latency for writes and potentially reads. If it's a multi-region service, then all writes need to block until they're acknowledge by at least a second region, right? It seems like that will add latency to every request and may not necessarily be a good thing. Similarly, at read time, latency could fluctuate depending on which region you query, and whether your usual region has the data yet. I'm just speculating though, not having read any more about the service. It does sound nice to have the choice to fall back to another region and take the latency hit, instead of an outage. On the other hand, regions are already highly available at existing cloud providers (with zones being a more common failure point).

Is PubSub mature? The FAQ suggests that you should authenticate that Google made the requests to your HTTPS endpoint by adding a secret parameter, rather than relying on any form of HTTP-level authentication.

> If you additionally would like to verify that the messages originated from Google Cloud Pub/Sub, you could configure your endpoint to only accept messages that are accompanied by a secret token argument, for example, https://myapp.mydomain.com/myhandler?token=application-secre....

This feels rather haphazard. If I'm exposing an HTTPS endpoint in my application that will trigger actual behavior upon the receipt of an HTTP request, then of course I "would like to verify that the messages originated from Google Cloud Pub/Sub", so that they're not coming from some random bot or deliberate attacker who happened to learn my URL.

vgt · on Oct 24, 2015

PubSub is not a product I work on, so apologize for lack of detail:

- PubSub is used by Google internally to power everything from Android notifications to Hangouts messages. So it's certainly proven.

- A lot of your questions are answered in docs:

https://cloud.google.com/pubsub/

https://cloud.google.com/pubsub/docs

You can always reach out to me, and I can get you in touch with a PubSub SME.

Pyxl101 · on Oct 24, 2015

I didn't see anything in the docs that touches on those subjects in detail (I did skim the docs looking for sections and pages that might contain answers to my questions before I posted), but please point me to the page that does if you know of one and I'd be interested to read it! I trust that your perceptions and information are accurate, but cite-able and reference-able information is also valuable.

drewhk · on Oct 24, 2015

I see this: https://cloud.google.com/pubsub/subscriber

In the "Delivery contract" section:

"For the most part Pub/Sub delivers each message once, and in the order in which it was published. However, once-only and in-order delivery are not guaranteed: it may happen that a message is delivered more than once, and out of order."

So it is at-least-once delivery as far as I see.

api · on Oct 24, 2015

If you need 10000 cores for 5 seconds, that I'd awesome. My post was aimed at the 95% who don't.

vgt · on Oct 24, 2015

If you have a SQL query that takes 50,000 core-seconds, it's probably more useful to execute that query using 10,000 cores in 5 seconds rather than 10 cores and 5000 seconds, especially if cost is the same. Even better if you never have to spin up a VM or worry about scale. This benefit is tangible and applicable to anyone who runs SQL. The reason this isn't prevalent is because it's economically and technologically prohibitive. BigQuery tips that scale in the other direction.

Point is, higher-level cloud-native services unlock very interesting use cases that are applicable for both small-scale startups and large companies, use cases that are impossible with just VMs.

api · on Oct 24, 2015

I'm not really disagreeing (much), but very few things fit that criteria. More common are simple problems so overengineered that they sprawl across two Amazon availability zones when a straightforward implementation could serve the whole customer base off a $20 a month VPS. This is more depressingly common than you think. Also depressingly common is a 50000 CPU second operation that could be a 1 CPU second operation with a few indexes and a smarter algorithm. AWS adds a lot of carbon to the atmosphere cranking through crap code. Trust me I've seen it.

What Amazon and kin have done is offer developers a new sexy way of over engineering. The AWS stack is the new Java OOP design patterns book. Yes, there is occasionally a time when an AbstractSingletonFactory is a good thing but I guarantee you most of those you see in the wild are not those times.

The real genius was to build a jungle gym for sophomore programmers to indulge their need to develop carpal tunnel syndrome where everything bills by the instance, hour, and transaction. If Sun had found a way to bill for every interface implemented and every use of the singleton pattern they would have been the ones buying Oracle.

vgt · on Oct 24, 2015

Likewise, but I think you're getting into the philosophical, not the practical. You may choose to live in a single-CPU world for your database, but you're simply disqualifying yourself from a whole lot of interesting use cases. Index+algo only solves a sliver of analytic use cases. And, ultimately, I'm afraid you're creating a world where you cannot effectively understand the shape of your data and you cannot effectively test your hypotheses, so you go with gut feel. And, perhaps more importantly, you cannot create software that learns from its data.

Your argument can be summarized thus as this - do not give people incredible computing capacity at never-before-seen economic efficiency, because they will use it inefficiently. I'm afraid this argument gets made every time the world gets disrupted technologically (horse vs car anyone?).

Edit: I may argue that if "carbon footprint" is your prerogative, then economies of scale + power efficiency should tilt the scale towards cloud, no? AWS is certainly on the dirtier side, but there are other, greener clouds.

api · on Oct 24, 2015

I'm not saying what you think I am saying. The thread was about how the cloud is immensely profitable, and I'm saying that a good chunk of that is built on waste and monetization of programmers' naive tendencies to overcomplicate problems.

I am not arguing that there are no great use cases for these systems. But I would be willing to bet that those are less than half the total load.

It's like big trucks. How many people who drive big trucks actually need big trucks? Personally I like my company's Prius of an infrastructure. :) And of course we've architected it so it can be a fleet or an armada of Priuses if need be, with maybe just a bit of work but if we get there I will be happy to have that problem.

Pyxl101 · on Oct 24, 2015

If availability and scale are not important, and you can tolerate having to engage a human in the event of a hardware failure, then sure a $20 VPS might suffice. You could also run a single virtual machine in one zone in the cloud.

But I think you might underestimate the amount of use-cases that do legitimately benefit from and desire a greater degree of reliability and automation. When one of my machines dies, I don't want to be notified, and I don't want to have to do anything about it. I want a new virtual machine to come online with the same software and pick up the slack. Similarly, as my system's traffic grows over time, I want to be able to gradually add machines to a fleet, to handle my scaling problem, or even instruct the system to do that for me.

Plenty of use-cases may not require this, but I'm not convinced that the majority of systems in the cloud do not. Every system benefits from reliability, and it's great to get it cheaply and in a hands-off way. In the cloud, I can build a system where my virtual machine runs on a virtual disk, and if there's a hardware failure, my VM gets restarted on another physical machine and keeps on trucking without my involvement. As an engineer and scientist, I can accomplish a lot more with a foundation like this. I can build systems that require nearly zero maintenance and management to keep running, even over long time scales.

I don't think I disagree with you that some people overengineer systems, but I think I disagree with you about how much effort it requires to achieve solid availability and a high level of automation. It's not a lot of effort or cost, and it's a huge advantage. Once I build a system I never want to touch it again.

A certain segment of users are adopting these technologies because they want to be prepared to scale. One of the advantages of "big data" products even for small use-cases is: all successful use-cases grow over time. If you plan for success and growth, then you may exceed the capabilities of a traditional technology. If you use a "big" technology from the beginning, then you can be confident that you'll be able to solve increases in demand by scaling up, rather than by rearchitecting. As these platforms mature and become easier to use, the scales begin to tip, and they no longer require more engineering time than the alternatives; a strong hosted platform actually requires less time in total, especially when you consider setup and maintenance. Many of these technologies do an excellent job "scaling down" for simple use-cases too. While they have been difficult to use, they're getting easier. For example, MapReduce-paradigm technologies are becoming fairly easy with Apache Hive, and fast with Spark. They're becoming easier to set up due to hosted variants like AWS's ElasticMapReduce or Google Cloud Dataproc, etc.

_delirium · on Oct 24, 2015

I don't think you shouldn't make the capability available, but I wish more people would stop to ask, "do I need this?"

Since I do data analysis and machine learning (sometimes), a common one I see is people using "big-data analytics" stacks when they don't have anything remotely in the range of a big-data problem. Everyone really seems to want to have a big-data problem, but then it turns out they have like, single-digit gigabytes of data (or less). And they want Hadoop on top of some infrastructure to scale a fleet of AWS VMs, so they can plot some basic analytics charts on a few gigs of data? They would be better served by the revolutionary new big-data solution "R on a laptop". But somehow many people have convinced themselves they really need Hadoop on AWS.

Though I haven't used it yet, BigQuery does seem interesting in comparison, because it at least seems like it doesn't hurt you much. The Hadoop-on-VMs thing is objectionable rather than merely unnecessary, because you get this complex, over-architected system for what is not a complex problem. BigQuery at least seems like, at worst you end up with basically a cloud-hosted RDBMS with scaling features you don't need, which isn't the end of the world as long as the pricing works for you.

edit: Just to clarify, I'm not the person you were replying to, just someone who also has opinions on this. :)

vgt · on Oct 24, 2015

I agree with you :)