Real World Microservices: When Services Stop Playing Well and Start Getting Real

sakopov · on May 7, 2016

From my own experience building a platform of products on top of microservices, I find that the most difficult part about this architecture (and the one that nobody ever talks about) is how to share data between microservices via message bus instead of direct requests. If you can get this down the rest, in my opinion, is just pure bliss and a piece of cake.

devonkim · on May 7, 2016

Message buses and queues are far, far easier to manage than API endpoints across a bunch of loci of control, so this may be why it's not discussed as much as, say, how the lifecycle of REST services work. Topics and queues with different segmentation features are much more powerful and fine-grained in control compared to a service that's tied to HTTP transport than, say, a REST or SOAP API. For object serialization / marshaling, you can use a lot more stuff now all of a sudden like Protocol Buffers, Capn Proto, Thrift, Avro. The recommended advice comes down to what your bus / queue is (each one offers vastly different best practices and features to implement them) as well as how you want to scale out and up. But perhaps a general set of terminologies / jargon may be worth pursuing as a community for the problem.

The question I really have is how people have managed to upgrade their message bus and to keep availability high with rollback contingencies while upgrading, say, RabbitMQ or Kafka across all your connected services. I really don't hear about that as much as how to do rolling restarts / upgrades of nodes for a service endpoint using some feedback from monitoring and metrics.

moonlighter · on May 10, 2016

Can't answer about RabbitMQ or Kafka, but if you're using a managed service like AWS SQS it's a non-issue. And if you need to test or make changes to a queue, it usually just means having to create another queue in parallel while the old one still works, and switch over once ready. The guys from Shazam gave a great talk about this at the AWS:reinvent conference back in 2013: https://www.youtube.com/watch?v=k_54Jcmi4zM

rando289 · on May 8, 2016

> Topics and queues with different segmentation features are much more powerful and fine-grained in control compared to a service that's tied to HTTP transport than, say, a REST or SOAP API.

"than" and beyond is an illogical statement. Like Y is Z than X.

devonkim · on May 8, 2016

Funky parse as I wrote it out, but you're correct. Despite the problem, I think the point being made is understandable. You can replace "than" with "like."

chuhnk · on May 7, 2016

Event sourcing is the thing you're looking for and something that will probably be the focus of many building microservice platforms.

notduncansmith · on May 8, 2016

This. About 2 years ago, I started to notice the incredible usefulness of event sourcing and its applicability at various layers of the stack.

From the React/Flux frontend paradigm, to the "everything is a stream" philosophy of Unix systems, to Erlang actors, to event sourcing for application backends, to databases like Datomic and CouchDB.

This is why I'm so excited about Elm right now - it explicitly puts this at the core of every application, and gives you a really nice language and type system for encoding your domain logic and manipulating data.

Time to write a blog post, I guess.

Terr_ · on May 8, 2016

I think the real network-level win comes from rigorous events, period.

Event sourcing, on the other hand, is an internal implementation detail of a particular service... Which happens to get you a good stream of events as a side-effect.

devishard · on May 8, 2016

Have you tried Erlang? It's basically designed around solving that problem. Many people use RabbitMQ to solve that problem and it's no coincidence that RabbitMQ is written in Erlang. Elixir is possibly a similar option (don't know, haven't tried it).

greenleafjacob · on May 8, 2016

Erlang's raison d'etre is not message passing but fault tolerance. From Joe Armstrong[1]:

> The work described in this thesis is the result of a research program started in 1981 to find better ways of programming Telecom applications. These applications are large programs which despite careful testing will probably contain many errors when the program is put into service. We assume that such programs do contain errors, and investigate methods for building reliable systems despite such errors. The research has resulted in the development of a new programming language (called Erlang), together with a design methodology, and set of libraries for building robust systems (called OTP).

[1]: http://ftp.nsysu.edu.tw/FreeBSD/ports/distfiles/erlang/armst...

devishard · on May 8, 2016

I don't think you can talk about fault tolerance and message passing as different in Erlang. Sure, one is the how, the other is the why, but that's just semantics.

pm90 · on May 7, 2016

OpenStack actually does this really well, using RabbitMQ as the messaging tool.

https://wiki.openstack.org/wiki/Oslo/Messaging

eva1984 · on May 8, 2016

You need to further define 'share data'? Usually it could be done, to connect multiple different services through a central data storage, let it be redis/cassandra/RDBMS/HBase...

If what you mean by sharing data is about dependency between calls to different services, e.g. A -> B -> C..., I would go with Storm and other stream processing frameworks, cause that is why those frameworks are built...

sakopov · on May 8, 2016

I'm simply talking about scenarios where one API needs to talk to another to get a resource it needs to act upon in some way. This can get vary hairy very quickly especially when each microservice ends up with copies of resources owned by other microservices that it now has to update asynchronously via pub/sub any time there is a change. This was the biggest challenge. And I think any teams willing to switch to this type of architecture need to look at it from this angle because monitoring and troubleshooting such systems is quite an undertaking.

ben_jones · on May 7, 2016

Redis. Redis all the things. It can't be understated the peace of mind knowing your operations are somewhat atomic.

devonkim · on May 8, 2016

Wait, peace of mind knowing your operations are "somewhat" atomic? Am I not understanding something but isn't that something that would normally be disturbing when it comes to distributed transactions (what microservices oftentimes structure)?

rashkov · on May 7, 2016

I think redis definitely has a place as a pubsub channel in a micro services architecture, and that in this capacity it solves the issue of sharing information among services. I'm not really seeing your point about atomic operations though. Would be curious to hear more about it

ninkendo · on May 7, 2016

Interesting article. It was initially confusing to me because I wasn't able to find what's unique about this approach, and how it's different from the typical Ambassador pattern that I'm more familiar with. That is, what does the word "Routing" even mean in this context?

I think the answer is that he's doing essentially layer-7 routing, using the HTTP path (and verb?) to decide what backends to route to, rather than doing it on a per-port basis (which is necessary to support non-HTTP services.)

Implementations I've run into before seem to fall into a few categories:

- A service declares that it needs to talk to, say, 3 other services. Each of these upstream services is assigned a unique client-side port, and an ambassador proxy is launched alongside each instance, which exposes those 3 ports, routing each to the corresponding 3 backend services. To keep the ports out of the source code, typically environment variables are automatically assigned, so you just talk to ie. "my-ambassador:${UPSTREAM_PORT_NAME}"

- A service is responsible for using a service discovery layer like zookeeper or etcd to find backends on its own. This is important for things that use raft/paxos/gossip, where blind routing isn't enough, you actually need to keep track of peer instances (although in that case, the service discovery layer is only used for initial discovery).

- Service discovery is done with just plain DNS on well-known ports (or even SRV records if you're lucky enough to have client software that can tolerate them), and you just hope the right thing happens. This can be accomplished with things like SkyDNS on top of etcd... a surprising amount of flexibility can be accomplished by putting logical things right in a hostname.

This approach seems to be like a more opinionated version of the first approach, but instead of ports, it uses HTTP routes, which is definitely more flexible, but only works with HTTP. The routes can contain enough information to route more intelligently than just enumerating your dependency services ahead of time and getting static port assignments.

It's almost like an API gateway distributed as an ambassador to each service. At least, it looks a lot like an API gateway, in that it presents the whole mesh of services underneath in a single URI namespace, using HTTP to route requests accordingly and handling things like TLS and potentially authentication. It's definitely an interesting approach, and something I'm curious to try.

atombender · on May 7, 2016

You'll want to read the whole pitch about Linkerd [1], which the article seems to assume the reader has read. In short, it's a sidecar proxy that performs all the functions needed to glue apps together. This allows apps to be simpler; they can implement a minimal interface and rely on simple protocols such as HTTP, with no knowledge of how to reach their remote partners. Linkerd implements a bunch of techniques such as health checks, load balancing and circuit breakers.

[1] https://linkerd.io

ninkendo · on May 7, 2016

Well, linkerd was what the article was about, so it's what I was referring to. A "sidecar" sounds like the more established term for what I was calling an ambassador proxy, so it's nice to learn that that's what people call this pattern.

Your explanation sounds a bit "no duh" though, since I can't imagine any way of implementing this pattern that doesn't abstract away knowledge of how to reach remote partners, perform health checks, load balancing, retry logic, etc. It's generally what the term "proxy" implies. Or am I missing something?

Edit: I think my confusion is caused by my experience with running vanilla Mesos (not Marathon or DCOS, but our own custom frameworks), where it's expected that you just have to implement this stuff yourself. I never even thought of open sourcing or productizing what I've written, because I kind of thought that everybody else must be rolling their own quick solutions for these problems too? I admit I'm probably the one with a messed up worldview though.

I think the mesos community is still in the phase of picking the winners for best practices, and hopefully one day it'll be obvious that you use services like linkerd and don't roll your own solution.

atombender · on May 7, 2016

Apologies, I didn't mean to sound condescending. I think you got it.

I just wanted to point you to the official explanation, since you seemed to be asking for clarification as to Linkerd's purpose.

A proxy can be a lot of things, of course. Linkerd is explicitly designed to route RPC, not just any web traffic. Linkerd is designed to run close to your app and offload all the routing intelligence.

For example, it seems quite common these days to build many types of service discovery and so on into the app itself: App wants to find another microservice, so it looks up the target host in Etcd or ZooKeeper or whatever, then talks to it, handling retrying and load-balancing and so on. If you're using a DNS solution like SkyDNS or Consul, then the app is isolated from the lookup mechanism, but you're still talking directly to your peer. The opposite, older trend is to use something like HAProxy to handle the routing, but HAProxy was designed for a fairly static set of routing targets.

Linkerd is a bit of a middle way: Make the app stupid and put all the operational intelligence in an external process that isn't actually your app, but still sort of behaves like it is. Linkerd is designed to be dynamically configured to support all the kinds of glue you can think of (ZooKeeper, Kubernetes and so on) so that no app changes are needed to support different routing schemes.

ninkendo · on May 7, 2016

Yup, makes total sense.

I think I'm just jealous that they're getting attention for solving something that I solved too, but didn't even think to release it because I thought everybody else was just like me.

My approach involves running a sidecar that registers a set of etcd watchers, and when upstream services move around, it passes the list of backends through a config file template (using Go's templating syntax), and runs a configurable command.

Meanwhile, other services' sidecars are health checking them, and keeping them in etcd only as they are healthy.

We wire that up to have the watcher process in the downstream sidecar, which rewrites an haproxy config with the list of upstream backends, and triggers an haproxy restart when it changes the config file.

And then, that whole thing is wrapped in a declarative syntax that lets you say "here's the things I want to talk to", and it knows how to find them in etcd, how to construct the haproxy template, and how to restart haproxy, and puts the whole thing in a docker container that links to your container.

And then we wrap all that in a web UI (and CLI) that lets you say "I want my service to talk to that service" and all the things happen for you.

Looking back, it shouldn't surprise me that this level of effort is something people would want an off-the-shelf solution for. But to be fair, I started doing this a few years back so things like DCOS and Marathon and confd and linkerd and namerd didn't exist then. :-D

rdli · on May 8, 2016

To expound on the second approach listed, what I've seen is not just a service discovery layer, but client libraries (as opposed to sidecar) that connect to your service discovery (e.g., Netflix Ribbon which is EOL).

Curious what people think of a sidecar approach versus the client library approach? Any preference?

qaq · on May 7, 2016

At times it seems the only thing we are doing with microservices is shifting complexity into different areas.

buro9 · on May 7, 2016

Well yes. That's exactly what we're doing.

We're breaking down complex code into simple code with complex interactions. Moving the complexity into a few universal platforms that everyone uses, largely in the hope that people smarter than us do that work on the complexity and we can focus on just the little simple bits.

Of course, there's lots that is difficult whilst we collectively learn about what has previously been locked up in organisations like Twitter, Google, etc. How to do a transaction distributed over many microservices? Do RDBMSs still make sense with microservices? (The querying changes from SELECT everything you need, to many SELECT just this bit and so centralised databases see a dramatic increase in simple queries... better suited to key:value stores.)

But yes, exactly what you say... we are.

chuhnk · on May 7, 2016

Microservices is about tradeoffs. Many people will reiterate this over the coming years and attempt to make it clear its not the be all and end all of software engineering. Before it had this name it was known by many others. It is in fact simply distributed systems.

The logical evolution of ones architecture when scaling from zero to orders of magnitude beyond is to eventually split out functionality so it can be worked on and scaled independently. That's it, that's all its about. And over time you find this becomes the pattern that helps stabilise and speedup an organisation as the number of people increase.

Most of the time it's an organic process. I would never tell anyone to start with microservices but to merely keep these ideas in the back of their mind and pick tools that simplify the process of moving to distributed systems later.

kpil · on May 7, 2016

True. No silver bullet this time either...

But right now it's a cool thing, since google is doing it. And Netflix, or whatever. Organisations are about to waste crazy amount of money on microservification of "monolithic" applications that are perfectly fine, because they think it will solve all of their problems.

Microservices are not solving complexity, it is adding complexity in order to solve problems related to scalability and size, and you need to be good to pull it off. If you are working in a half-arsed inhouse development department in a medium sized company in a uninteresting business, chances are that the organisation (as a whole) is not good enough even if there are some smart people around, and you should be grateful that the RDBMS is there and rolls back problems caused by the shitty code that keeps piling up in your git repository.

I started working for about 20 years ago with distributed systems in the telecom industry, where we had to scale telephony services over a lot of machines, both vertically and horizontally. We had key-value datastores, services, application logic, and interfaces as small programs running in it's own machine. And it was bloody hard to handle transactions, routing, fail-overs and generally getting the right balance between calling a service or doing it locally. We did web applications like that too actually - using cgi, a custom html template format and the crappiest script language ever imaginable and it kind of worked too.

It was rather ahead of it's time, much thanks to a few smart guys that were trying to solve the rather hard problem of using cheap hardware for providing very reliable telecom services. It never was "very" reliable, but it was ok. At least on sunny days.

It scaled up to a point. The networks weren't that fast, and the sheer complexity of it all was a limiting factor. We had to create tools for generating the configuration and at one point I found out that we had more than 100.000 lines of configuration in a moderately large system, which explained why it took a while to assemble these systems by hand.

But as time went, the hardware was catching up and you could more or less put everything in one box. And we did. And we started to use RDBMs because we could really need the power of a flexible and transactional database manager, and it was a bliss to write a large monolithic application for user provisioning and administration and not needing to handle every shitty little detail by yourself.

I am very reluctant to going back to key-value stores and really small services, unless we really need the speed. It comes with a heavy cost. (Even though you get more for free now, as more people is doing it.)

The best trade off in my opinion is to build as large systems as possible while it still handles the load and it is still possible to work on efficiently with a few teams. When it's hard to keep up with what is happening and how everything works, it might be better to split the system into smaller systems and (semi) independent teams. The key is the independent team, real devops, and a hard focus on making deployments fast and with no or almost no downtime.

Unfortunately, where I work now, the major limiting factor of our velocity is our dependencies to other teams and systems. If we can build something on our own we can typically do a reasonable feature in between 3 days to 3 weeks. If we are depending on development by another teams changes in other systems, it minimally takes 3 weeks.

And the problem grows exponentially. If 3 systems are involved, it will take us 3 months or more, and failure is always an option.

I don't see how more services will solve that rather hard organisational problem, but that is apparently what some people think will happen, effortlessly, just because technology.

kawera · on May 7, 2016

Divide and conquer. As much as we would like it to not be the case, complexity is inherent in medium/large scale systems. Microservices are one way to split and isolate a system in more easily manageable chunks thus potentially lowering the risk that comes with complexity. It does have its pitfalls, though.

dredmorbius · on May 7, 2016

That's much of technology in a nutshell. Starting with banging two rocks together.

markbnj · on May 7, 2016

Love the use of the header for per-request overrides to the routing table. I've rolled my own solutions using haproxy as a k8s service a couple of times now. Really looking to move to something more packaged that handles some of these use cases. Will be giving linkerd and namerd a try.

jeevand · on May 7, 2016

I am new to Microservices, any good place to learn about major patterns/anti-patterns?

mac01021 · on May 7, 2016

This is just a 40 minute lecture, but you might find the ideas inside helpful.

It talks about anti-patterns that lead to the construction of a "distributed monolith" and how to avoid them.

http://www.microservices.com/ben-christensen-do-not-build-a-...

loserpenguin15 · on May 7, 2016

I would recommend reading Sam Newman's 'Building Microservices'

kashif · on May 8, 2016

I wrote a small post that might be partially relevant - https://medium.com/@kashifrazzaqui/will-the-real-micro-servi...

zaroth · on May 8, 2016

> When Twitter moved to microservices, it had to expend hundreds (thousands?) of staff-years just to reclaim operability.

That's both impressive and not surprising. Orchestration is just really hard. Big systems that are supposed to behave rationally and be easy to upgrade and debug... it seems like such a meager request at first.

sheeshkebab · on May 7, 2016

Interesting project. It would be great if something more out of the box was available in AWS - or a project that could package AWS services for a similar pattern (Route53/elb etc).

It's cool to run your own proxy, however doing that at production level is far from simple, at least with smaller teams.

raarts · on May 7, 2016

I don't understand. You either deploy this as a sidecar proxy, in which case you need to reconfigure a lot of them, or you deploy them centrally introducing a spof?