100 million reviews/year is only 3 reviews per second on average. Sure, they seem to do more then just that, like voting, comments, etc. But it still seems like something an old school stack could handle on a single large instance.
Reading between the lines it seems the problem wasn't scaling, but programmer productivity. Smaller code bases is often easier to work with so I guess they solved that by dividing it up into many small services. The blog could use a more detailed description of the problem they are actually solving.
Whenever I see headlines like these, "X billion messages per day/hour/second in some web service Y", I wonder what the hell is that service doing that it generates so much messages? I.e. I could understand Facebook, with its billion+ of users and built-in messaging platform, could generate billion of "messages" a day. But a mostly read-based service like Yelp?
But I finally realized - those messages are probably mostly tracking, ads, more tracking, some infrastructure work and even more ads & tracking. The sausage machine that turns people into money.
Indeed. I'm working in adtech and we get four billion requests for bids per day (more on Black Friday and the holiday season). Add in the ad serving and processing of all the results and it blows up fast.
This. A lot of people look at apps like Instagram and Facebook, and think that they're simple to build and manage, when they don't realize that the part of the app the consumer directly interacts with on their screen is the tip of the iceberg when it comes to the entire business.
I ‘ve come to realise that no matter what your data engine choice is(Storm, Spark, Flink, DataFlow, Ruby or Bash scripts, whatever) it is extremely beneficial to persist incoming data first to a distributed log.
Even if all you want to do is accept events and immediately persist them on some data store (Cassandra, mySQL, text files, etc), you re far better of first publishing them on a distributed log, and then consuming from it and persist on, say, a mySQL or Cassandra cluster.
You decouple the data flow from processing - and this means you don’t need to necessarily directly attach your firehose to the processing systems. You can just accept them as they come and deal with them later, if ever - in fact many distinct processing engines can each, asynchronously and independently, access and scan those previously collected event streams, and they can do so at whatever pace makes sense(i.e depending on how fast they can process messages/events).
Using Kafka as a core infrastructure technology, and for persisting any incoming and generated messages/events to it should be the default strategy. Also, it’s extremely unlikely you will hit service capacity limits, because publishing to Kafka, or a Kafka like service; publishing is mostly about appending data to a file and consuming is mostly about streaming (sequential scan) from files — all very fast and low overhead operations.
Shameless plug: if you want a Kafka like service, with, currently, fewer features but with better performance and far less requirements and a standalone operation mode(no need for ZooKepeper), you may want to check https://github.com/phaistos-networks/TANK
Anyway, I can’t recommend Kafka and investing on logs enough. Also, Kafka Streams is very elegant; likely based on Google’s DataFlow design and semantics. If you are using Kafka or plan to use it, you may want to evaluate it and adopt it over other heavier footprint and more complicated systems (e.g Storm or Spark).
Most of the stream processing in the Data Pipeline happens inside of an internal project called PaaStorm, which is storm-like. It was built to take advantage of our platform as a service (http://engineeringblog.yelp.com/2015/11/introducing-paasta-a...), which handles process scheduling really well. Architecturally, it's pretty similar to Samza, with distributed processes communicating using Kafka.
We do use Spark streaming, and are starting to use Kafka Streams and Data Flow, where they're a better fit. I'm personally most excited about Beam/Flink. We'll probably end up replacing the PaaStorm internals with some other tool, when one with good python support matures. Beam's event-time handling and windowing seem really promising at this point. https://www.oreilly.com/ideas/the-world-beyond-batch-streami... is a great overview of the different concerns for stream processing.
Hi Justin! Thanks for sharing, very interesting stuff.
How do you scale Kafka to handle the massive amount of traffic (and storage) that you seem to generate daily?
With services talking among themselves via HTTP there is a lot of resilience built-in. Do you have anything in place to avoid this becoming a single point of failure? It must have become the most critical piece of your infra.
Scaling Kafka is pretty simple, the operations document contains most things you'll need to get started [1].
We push 500k documents a second through over 10 6 core/24gb ram hosts pretty uneventfully. Only real pointer is to size ZK appropriately and make sure you leave lots of memory for the file system cache.
I'm not actually a good resource on scaling Kafka. Our distributed systems teams do a great job of providing reliable infrastructure and scaling it up, so on the application side we are mostly able to treat it like a black box that just works.
In general, I do think poooogles covered it well. Kafka is designed to scale. The one thing we do that you might not expect is splitting data across clusters, depending on what guarantees we want to provide.
We also tend to make sure all data is replicated using geographic distribution to avoid SPOF issues. We do use the min ISR settings and different required ack levels, depending on we want to trade off durability and availability for an application.
"Yelp passed 100 million reviews in March 2016. Imagine asking two questions. First, “Can I pull the review information from your service every day?” Now rephrase it, “I want to make more than 1,000 requests per second to your service, every second, forever. Can I do that?” At scale, with more than 86 million objects, these are the same thing."
Who is making 1000 requests per second to retrieve all of the 100 million reviews? Other services within Yelp? Why would they pull all reviews every time, and not just the reviews they haven't already processed?
How is 1000 requests per second the same thing as pulling all review information every day?
This section is really confusing and needs some clearer explanation.
I think it's saying that a single iteration over the entire set, would translate to 1000 requests per second for a day (if done naively as one request per object). It's really talking about the N+1 problem.
I think the unstated assumption is that there's some sort of processing that occurs on each review every day. So with 100 million reviews every 24 hours, that's just over 1157 requests per second.
It has to be internal because they are incredibly protective of their API (hell they sold their firehouse to just 1 darn company). My guess it's NLP type processing. Things like review highlights, recommendations, etc. that's probably 99% of their load.
The user-centric stuff like submitting reviews, comments is trivial - as another user said , a large single instance is enough.
This is a nice blog, it would be nice to also read one where you explain how you "judge" whether a review is fake or not - I have heard so many times from small business owners how legitimate reviews get hidden/deleted from their page. I wonder if it's an algorithm or a "humanized" process with lots of mechanical turks :) (Not in detail of course we don't want people to game your algos)
It's quite easy to get to that number of lines once you have code for integrations with other products, monitoring, devops, patched versions of broken or unmaintained libraries, a growing number of tests etc. We run a relatively small product and we already have 30k lines with just two people.
Displaying a list and showing a map is one feature of Yelp, but you could also say showing "hello world" in 3D text with OpenGL is a 3D graphics engine. In the real world they are both complicated things.
You realize that just because a tutorial that purports to build a Yelp clone does not mean that it will actually teach you to build the entirety of the Yelp app, right?
one of my big questions is how did you arrange security? Kerberos? What about maintaining service-users and permissions to send / rec messages, or some form of web of trust? With message based comms encryption for the intended users becomes possible?
As far as I can tell those accusations have not been substantiated. I'm ready to believe it if you have evidence - a recording of a Yelp salesperson using extorionate language would do.
Reading between the lines it seems the problem wasn't scaling, but programmer productivity. Smaller code bases is often easier to work with so I guess they solved that by dividing it up into many small services. The blog could use a more detailed description of the problem they are actually solving.