The Case for Building Scalable Stateful Services

packetslave · on Oct 13, 2015

From a talk by Caitie McCaffrey of Twitter, at the Strange Loop conference.

Video: https://www.youtube.com/watch?v=H0i_bXKwujQ

Slides: https://speakerdeck.com/caitiem20/building-scalable-stateful...

themartorana · on Oct 13, 2015

Stateful: you have one web server!

Stateless: you grow to require tens of servers or more, horizontal scalability is much cheaper than vertical, look to software solutions to help slow expenses, move to NoSQL clustered DBs like Riak, Casandra, Hadoop, etc. 1-2 engineers can still run the whole show, cloud services, SaaS and PaaS are employed.

Stateful: you run thousands of servers, having since brought many services back in-house. Many if not most are your own metal, with dedicated staff. Looking to slow power bills and space requirements, you look once again at software solutions.

If you stay at the same growing company long enough, what's old will be new again.

statictype · on Oct 13, 2015

I think this would depend on your specific use case.

If you're app is basically managing data (managing bookmarks, note taking, etc...) then a stateless architecture makes sense

If your app is reacting to data in real-time to do some work (monitoring system, chat system, sending alerts on conditions) then using a statefull service makes total sense.

jkarneges · on Oct 13, 2015

For some use cases it does make sense, but even realtime/push systems can be made relatively stateless by isolating the stateful parts (e.g. into a message broker).

The way I see it, nothing we design is truly stateless. There are stateful pieces everywhere (such as TCP, beneath HTTP). What we can do is minimize the effects that the stateful pieces may have on other parts of the system.

ousta · on Oct 13, 2015

quite frankly I never understood the craze most of the companies had with stateless services. For the last 7 years we moved away from statefull to stateless without any aftertought just because it was what everyone was doing even tho sometimes it just didn't make sense to design something stateless. In terms of pure design this was always akward. I understand why it makes sense in some cases I just don't understand why it became the new black. If someone has an answer to that, ill be happy to hear

acjohnson55 · on Oct 13, 2015

Stateless services have some huge advantages, analogous to pure functional programming, especially when you truly commit to purity. The surface area for verfication and optimization is far more manageable, without having to worry about an internal state space. There's a dead-simple horizontal scalability story and reliability isn't much of a concern due to the fungibility of individual instances.

Any real service must, of course, have state somewhere. I've had a lot of success being very intentional about factoring state into the smallest possible interfaces, ideally contained in well understood products like databases.

Much like pure functional programming, it's not necessarily easy to go truly stateless. I've plenty of "RESTful" architectures get bogged down in the complications seemingly harmless stateful optimizations like caching layers, often to cover up poor performance of runtimes like Ruby and Python. It's very tempting to take a couple shortcuts and end up with the worst of both worlds.

Of course, stateless isn't always the best way to go, especially once you start getting to the exotic territory inhabited by the examples in the article. I think for 99% of us, those are problems we'll never experience.

I'd be interested in seeing a case where statefulness really let someone down, who wasn't at Facebook/Twitter/Google scale.

curryst · on Oct 13, 2015

I think it's tied to the rise of RESTful services. RESTful APIs lead to people getting used to thinking and working in stateless architectures, and it became the thing to do. If your API is stateless, why make the rest stateful? And stateful APIs can be... difficult to deal with, having to handle a lot more in your HTTP client than you do with a simple REST endpoint.

I also think open source has had an impact here. A lot of open source stuff is designed to have a simplistic architecture, since the users tend to be unwilling to start up 4 servers to run an application (whether it's a business that doesn't like the complexity, or a user who lacks the resources). That drives open source toward stateless services, and open source often defines what's "popular".

jakozaur · on Oct 13, 2015

Thumb rule, if you design service with many servers you have following options:

1. Have a stateless service. You can update it frequently with no downtime... Relatively easy.

2. Use some of the shelf service that provides states and you don't need to update that frequently (e.g. ElastiCache, Cassandra, ....). Relatively easy.

3. Write your own stateful service. For some applications it is a must (e.g. you do your own search service, data processing, game collision engine). Need to take care of state transition during restarts/upgrades, client routing is also tricky. Hard, but sometimes there is no way around to build efficient infrastructure.

4. Don't think about state and you may end up crying after your code hits the prod.

rdtsc · on Oct 13, 2015

A comparison with Erlang/OTP:

http://christophermeiklejohn.com/papers/2015/05/03/orleans.h...

deathtrader666 · on Oct 13, 2015

Hasn't Erlang & OTP already solved this?

reubenbond · on Oct 13, 2015

The Virtual Actor model implemented in Orleans differs from what OTP & Akka offer in that Virtual Actors have managed lifecycles and are never created or destroyed. Instead, they are activated when needed (fetching state from storage if needed) and deactivated when idle.

This helps you to reason about your system. You can never know if a node will fail at any point in time (in any system), but with Orleans you can be sure that an actor will be reactivated on a surviving node in event of a failure. In other words, you can continually message actor X without worrying about ever having to reinstantiate it on another node due to failure.

By default, Orleans maintains an eventually consistent mapping of actors to nodes (called silos) and relies on the storage layer to give strong consistency. Most of the default storage providers offer strong consistency.

By the way, Service Fabric includes an implementation of Virtual Actors which differ in a couple of ways: 1. The actors are placed using consistent hashing. Actor Ids are mapped to a range of partitions, the replicas of which can be moved between nodes in case of failure. 2. Actor state is physically stored on the actor node by default. Because Service Fabric uses distributed consensus to implement stateful services, they can persist each actors state in the partition which it belongs to. The state is replicated to a quorum of replicas on each write.

saryant · on Oct 13, 2015

Akka has added something along those lines with Akka Persistence and the new cluster sharding features. Actors can passivate (gracefully stop) and the cluster manager will route messages that would've gone to that actor to another actor in the cluster. Because the actors are persistent, they can be spun back up at their previous state on any cluster member.

reubenbond · on Oct 13, 2015

Akka persistence is different. It doesn't give you managed lifecycles like in the Virtual Actor model - so you aren't insulated from failures and you still have to consider when to create/recreate an actor. It gives you Event Sourcing / Command Sourcing (depending on how you use it).

ES is a work-in-progress for Orleans, but many of us are using our own ES systems.

For an implementation of Virtual Actors on the JVM, check out Orbit from Electronic Arts / BioWare: http://orbit.bioware.com/

dave_ops · on Oct 13, 2015

So you reinvented the vnode from riak_core?

reubenbond · on Oct 13, 2015

The first Orleans paper is from 2009 or 2010, so I don't think they are reinventing riak core.

EGreg · on Oct 13, 2015

I think that, in general, anything that has no persistence can be shared-nothing. State in shared-nothing consists basically of a cache which is updated by subscribing to changes in the data store and being updated, with only a slight lag.

Shared-nothing can include environments like user agents, proxies and web servers.

As for the persistence layer / data store, it should support horizontal partitioning. Especially useful is range-based partitioning based on a primary key whose prefix contains a Geohash ... because then you can route requests to the closest Region on AWS or some other host.

If one of your shards gets too large you can split it into two or more shards. All the monitoring and splitting can be automated with dev ops in the cloud to provision machines etc. so you don't need to wake up at 3am.

With this setup you can reliably grow your data store to an arbitrary size, and literally have only O(log n) growth in latency for any request. However there is one more issue to solve:

When you need to perform database queries that return a cross product, or join, do you compute it on the fly for the request (eg with mapreduce) or do you precompute the result whenever a row is inserted into one of the joined tables? The second way can be done in the background and uses memory-time tradeoff to cause the queries to be O(1). This can be really useful for queries that need to get the answer in realtime.

I would recommend using evented (eg Node.js) servers for queries that involve hitting multiple shards at the same time, or mapreduce type things. Evented I/O lets you wait only as long as the longest query.

Finally, I don't think things like socket.io will be horizontally partitionable easily, eg to a node cluster, so you'll probably want to have server affinity on a per-room basis.