Hacker News new | past | comments | ask | show | jobs | submit login
Data Infrastructure at IFTTT (ifttt.com)
192 points by devinfoley on Oct 15, 2015 | hide | past | favorite | 22 comments



> Lastly, in order to help monitor the behavior of the hundreds of partner APIs that IFTTT connects to, we collect information about the API requests that our workers make when running Recipes. This includes metrics such as response time and HTTP status codes, and it all gets funneled into our Kafka cluster.

> This way if you query Elasticsearch to find all API errors in the last hour, it can find the answer by looking at a single index, increasing efficiency.

This is a really good way to know if third party APIs are having problems. Staying up to date with all those APIs they support must take up significant amount of engineering effort. Many APIs are just second-class citizens for their product owners. Bugs are introduced, changes are made without announcements, and even if there are announcements when you're dealing with so many different APIs it's hard work keeping track of them all and making changes in your app to keep it running especially when APIs are turned off, or schema changes are happening. This seems to be the hard problem IFTTT is solving, integrating into APIs.

I'd shy away from starting a project that involves so many other companies APIs just because of how hard of a problem that is to manage, but IFTTT is doing a great job here.


I used to work at IFTTT and this exact set of problems is why I left to start a company. Pretty much everything I've done in the intervening years has stemmed from exactly what you describe. It's great to see the IFTTT team tackle this stuff head on.


To the author (or anyone else with experience): Any insight on why you guys chose Kafka + Secor over Kinesis?


I worked with Anuj at LinkedIn, so I'm thinking the most pedestrian answer is that Kafka works, and it's a familiar tool.


Thank you Jonathan!

As Jonathan mentioned, we made this decision around 9 months back and at that time Kinesis wasn't as mature and had less flexibility around retention period etc.

Kafka is very reliable (as I had seen it handling billions of events a day at LinkedIn) and has a huge open-source community around it. At IFTTT, we always prefer to use and contribute to open source ( http://engineering.ifttt.com/oss/2015/07/23/open-source/ ).


I'm assuming that you run Kafka within AWS. Much of the hardware requirements/suggestions I've seen for Kafka are all for non-virtualized environments. If you can get into it, could you share some details...

- What is the size of your Kafka cluster

- What instances types do you use?

- Do you use EBS or use ephemeral storage?

- How much do you over-provision to deal with instance loss?

- Any other gotchas/considerations?

Thanks!


Things might be different now that Kinesis Firehose has been announced. But Kinesis alone is not easier to manage than Kafka.


Amazon Kinesis is a great tool, the main problem is that it's relatively new while Kafka have been around for a few years already.


Anyone have an idea why they would use MySQL over PostgreSQL?


The usual reason is having more experience with MySQL than PostgreSQL.


MySQL works just as well for most use cases.


> In order to fully trust your data, it is important to have few automatic data verification steps in the flow.

Is this a typo? Did they mean 'a few'?


We have fixed it! Thank you!


How do you feel about the data being delayed, sometimes 1 day? Why not stream the data in realtime to the kafka cluster?


Whatever data we need in realtime, we do stream it to the Kafka cluster.

We don't do it for the production database because we don't need it in realtime.


Original link: http://engineering.ifttt.com/data/2015/10/14/data-infrastruc...

Can a moderator such as dang fix this please?


Yes. Url changed to that from https://medium.com/engineering-at-ifttt/data-infrastructure-....

Medium's republishing API has already become a problem for original sources on HN.

p.s. Comments in the threads are an unreliable way to reach us. There are too many for us to see them all. The reliable way is to email hn@ycombinator.com.


He is probably just syndicating this to medium with the new api's introduced.


To hijack your tongue-in-cheek comment...I genuinely would appreciate it if someone, at some point, looked at analytics before and after syndicating to Medium to not just compare number of visitors between the original blog and Medium, but whether PageRank for the original blog dropped due to duplication penalties applied by Google's algorithm.


It's worth nothing that Medium described this syndication workflow as an expected use case during the announcement.

I don't think they would have done that if there was a knowing SEO downside. It would hurt everyone.


I want to take them at their word...but it's more up to Google, isn't it? And doesn't Google still rely (in part) on the canonical meta tag to resolve duplicates? Currently, the posted Medium post has this:

      <link rel="canonical" href="https://medium.com/engineering-at-ifttt/data-infrastructure-at-ifttt-35414841f9b5">
It would be relatively easy for Medium to appropriately set that tag -- I mean, if "original URL" is captured somewhere in the API, or in the Medium post-create-admin interface (I don't know, I haven't logged into my own Medium for awhile)...that would explicitly resolve the ambiguity, though at an obvious cost to their own SEO.


If you're using ECS to manage your clusters on AWS take a look at Convox. We're building an open source platform that makes building, versioning, and deploying code to AWS via ECS incredibly easy.

https://convox.com




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: