Hey, I'm the author. That migration was a while ago, so I had to look back at what exactly we did. The gist was our old python servers used a custom fork of eventlet (https://eventlet.net/) which didn't work well with the cloud pubsub library. The core library in itself is not unstable, but when we introduced it to the existing code it became unreliable.
I haven't followed mix panels development. I was surprised that a company which went as far as writing their own time series database apparently decided to go all in on Google cloud.
Disclaimer: Not the author, but I was on the team that migrated our infrastructure to GCP.
As a startup with limited resources, it's important for us to invest all our engineering strength into the things that create direct value for our business. We'd rather pay Google to manage machines and run services like Kubernetes, Spanner, Pub/Sub and others and free up the engineers to work on our core analytics platform.
I think the parent may have meant the opposite of how you interpreted it... That writing a tsd db from scratch doesn't match what you just stated about investing engineering time where it makes the most sense.
We don't run a TSDB. A TSDB doesn't work for the kinds of queries we run - specifically, TSDBs don't work if you want to
* analyze every datapoint you receive
* when the dimensional cardinality is high
* you want to analyze behaviors over time (e.g. the output depends on the orders of events followed - like creating a funnel report)
There's no off-the-shelf solution that does this at the scale at which we operate - hence the need to write our own custom solution.
Hey! I'm the author of the post. I just updated it with a chart showing the gains. Roughly, I'd say about 40% decrease in CPU usage. One note I'd like to add is that the original Python code was highly optimized, and we haven't yet done a performance pass of the new Golang code. We're expecting even better gains after that's complete.
I faced a similar challenge. The technique I used was to carefully read the code of the program to be rewritten, and then I wrote a spec for it. The format of the spec was that everything that was feasibly testable was written as a unit test, and everything else was written as comments between the tests and the text explaining them. Naturally the original code satisfied the test suite/specification.
Then I used that test/spec to do a TDD type development of the new service. It was the easiest rollout I've ever done. Everything just worked when it went into production. I even ended up giving some internal presentations on the process.
I also tested with logged input from the source program. It's neat to see this technique is common.
Wow cool article. I learned about Envoy which looks really awesome.
Just out of curiosity, what were some of the bugs you found? Were they related to semantics of python not carrying over to go? or was it that you tried using new go features like goroutines and they didnt work as expected?
The bugs were typically related to general correctness or translating python semantics. The correctness-type bugs were due to the complex nature of the API. The python translation bugs were just about getting the correct behavior of statements like
Could someone from Mixpanel elaborate maybe?