> Lastly, in order to help monitor the behavior of the hundreds of partner APIs that IFTTT connects to, we collect information about the API requests that our workers make when running Recipes. This includes metrics such as response time and HTTP status codes, and it all gets funneled into our Kafka cluster.
> This way if you query Elasticsearch to find all API errors in the last hour, it can find the answer by looking at a single index, increasing efficiency.
This is a really good way to know if third party APIs are having problems. Staying up to date with all those APIs they support must take up significant amount of engineering effort. Many APIs are just second-class citizens for their product owners. Bugs are introduced, changes are made without announcements, and even if there are announcements when you're dealing with so many different APIs it's hard work keeping track of them all and making changes in your app to keep it running especially when APIs are turned off, or schema changes are happening. This seems to be the hard problem IFTTT is solving, integrating into APIs.
I'd shy away from starting a project that involves so many other companies APIs just because of how hard of a problem that is to manage, but IFTTT is doing a great job here.
I used to work at IFTTT and this exact set of problems is why I left to start a company. Pretty much everything I've done in the intervening years has stemmed from exactly what you describe. It's great to see the IFTTT team tackle this stuff head on.
As Jonathan mentioned, we made this decision around 9 months back and at that time Kinesis wasn't as mature and had less flexibility around retention period etc.
Kafka is very reliable (as I had seen it handling billions of events a day at LinkedIn) and has a huge open-source community around it. At IFTTT, we always prefer to use and contribute to open source ( http://engineering.ifttt.com/oss/2015/07/23/open-source/ ).
I'm assuming that you run Kafka within AWS. Much of the hardware requirements/suggestions I've seen for Kafka are all for non-virtualized environments. If you can get into it, could you share some details...
- What is the size of your Kafka cluster
- What instances types do you use?
- Do you use EBS or use ephemeral storage?
- How much do you over-provision to deal with instance loss?
Medium's republishing API has already become a problem for original sources on HN.
p.s. Comments in the threads are an unreliable way to reach us. There are too many for us to see them all. The reliable way is to email hn@ycombinator.com.
To hijack your tongue-in-cheek comment...I genuinely would appreciate it if someone, at some point, looked at analytics before and after syndicating to Medium to not just compare number of visitors between the original blog and Medium, but whether PageRank for the original blog dropped due to duplication penalties applied by Google's algorithm.
I want to take them at their word...but it's more up to Google, isn't it? And doesn't Google still rely (in part) on the canonical meta tag to resolve duplicates? Currently, the posted Medium post has this:
It would be relatively easy for Medium to appropriately set that tag -- I mean, if "original URL" is captured somewhere in the API, or in the Medium post-create-admin interface (I don't know, I haven't logged into my own Medium for awhile)...that would explicitly resolve the ambiguity, though at an obvious cost to their own SEO.
If you're using ECS to manage your clusters on AWS take a look at Convox. We're building an open source platform that makes building, versioning, and deploying code to AWS via ECS incredibly easy.
> This way if you query Elasticsearch to find all API errors in the last hour, it can find the answer by looking at a single index, increasing efficiency.
This is a really good way to know if third party APIs are having problems. Staying up to date with all those APIs they support must take up significant amount of engineering effort. Many APIs are just second-class citizens for their product owners. Bugs are introduced, changes are made without announcements, and even if there are announcements when you're dealing with so many different APIs it's hard work keeping track of them all and making changes in your app to keep it running especially when APIs are turned off, or schema changes are happening. This seems to be the hard problem IFTTT is solving, integrating into APIs.
I'd shy away from starting a project that involves so many other companies APIs just because of how hard of a problem that is to manage, but IFTTT is doing a great job here.