Safely Rewriting Mixpanel’s Highest-Throughput Service

ZeroCool2u · on July 26, 2019

The author mentions finding the GCP Python PubSub API unreliable. That's definitely not been my experience.

Could someone from Mixpanel elaborate maybe?

evpa1g · on July 26, 2019

Hey, I'm the author. That migration was a while ago, so I had to look back at what exactly we did. The gist was our old python servers used a custom fork of eventlet (https://eventlet.net/) which didn't work well with the cloud pubsub library. The core library in itself is not unstable, but when we introduced it to the existing code it became unreliable.

ZeroCool2u · on July 26, 2019

Thanks for clarifying! Great write-up!

ma2rten · on July 27, 2019

I haven't followed mix panels development. I was surprised that a company which went as far as writing their own time series database apparently decided to go all in on Google cloud.

i0exception · on July 27, 2019

Disclaimer: Not the author, but I was on the team that migrated our infrastructure to GCP.

As a startup with limited resources, it's important for us to invest all our engineering strength into the things that create direct value for our business. We'd rather pay Google to manage machines and run services like Kubernetes, Spanner, Pub/Sub and others and free up the engineers to work on our core analytics platform.

paladinxx · on July 27, 2019

I think the parent may have meant the opposite of how you interpreted it... That writing a tsd db from scratch doesn't match what you just stated about investing engineering time where it makes the most sense.

i0exception · on July 27, 2019

We don't run a TSDB. A TSDB doesn't work for the kinds of queries we run - specifically, TSDBs don't work if you want to

  * analyze every datapoint you receive
  * when the dimensional cardinality is high
  * you want to analyze behaviors over time (e.g. the output depends on the orders of events followed - like creating a funnel report)

There's no off-the-shelf solution that does this at the scale at which we operate - hence the need to write our own custom solution.

rwilson4 · on July 26, 2019

Very cool! Any stats you can share about increased performance switching to go?

evpa1g · on July 26, 2019

Hey! I'm the author of the post. I just updated it with a chart showing the gains. Roughly, I'd say about 40% decrease in CPU usage. One note I'd like to add is that the original Python code was highly optimized, and we haven't yet done a performance pass of the new Golang code. We're expecting even better gains after that's complete.

mda · on July 26, 2019

I would not be surprised when moved prom python to anything like java, c#, go and end up at least 10x.

weberc2 · on July 26, 2019

It depends on how much of the “Python” implementation was actually C.

User23 · on July 27, 2019

I faced a similar challenge. The technique I used was to carefully read the code of the program to be rewritten, and then I wrote a spec for it. The format of the spec was that everything that was feasibly testable was written as a unit test, and everything else was written as comments between the tests and the text explaining them. Naturally the original code satisfied the test suite/specification.

Then I used that test/spec to do a TDD type development of the new service. It was the easiest rollout I've ever done. Everything just worked when it went into production. I even ended up giving some internal presentations on the process.

I also tested with logged input from the source program. It's neat to see this technique is common.

Exuma · on July 26, 2019

Wow cool article. I learned about Envoy which looks really awesome.

Just out of curiosity, what were some of the bugs you found? Were they related to semantics of python not carrying over to go? or was it that you tried using new go features like goroutines and they didnt work as expected?

evpa1g · on July 26, 2019

The bugs were typically related to general correctness or translating python semantics. The correctness-type bugs were due to the complex nature of the API. The python translation bugs were just about getting the correct behavior of statements like

if val: # ...

in Go.

welder · on July 26, 2019

Did p95 latency change due to the migration?

evpa1g · on July 26, 2019

Yes! You can see the p99 results here (we don't have an aggregated p95 measurement, so used p99): https://imgur.com/a/oQRbyBF

Both max and avg p99 latency became much more stable. Max appears to have gone down a little too.

marcrosoft · on July 26, 2019

s/golang/go