Warehouses – Load Your Analytics Data into Redshift and Postgres

n2parko · on Dec 10, 2015

Hey HN — Segment PM on the project here! Happy to answer questions.

Under the hood we're using NSQ as a queuing layer, S3 for storage and batched uploads, Amazon Aurora (for S3 indexing), DynamoDB for billing and metadata storage, and several distinct Go services that handle batching, transformation, schema updating, deduplication and internal consistency checking.

It's been in beta for several months and we're loading about 10,000 events per second into customers' databases today.

vyrotek · on Dec 11, 2015

How real-time is this stuff? If an order record is inserted can I instantly see on a dashboard the +1 based on a custom report query?

pkrein · on Dec 11, 2015

Scrolling down the posted page you can see the details on loading latency: it ranges from daily loading on the free plans to 30 minutes on the business tier.

mshron · on Dec 11, 2015

What about companies that have additional data sources? For example, a CRM. It seems like it's hardly a data warehouse if it's _only_ event data. Or am I missing something?

troels · on Dec 10, 2015

Can you say a bit more about how you use Aurora for indexing S3? What are these indices used for? (Or what do you mean by indexing)

schmatz · on Dec 10, 2015

Sure! We batch together data to load into warehouses by time and a few other properties. Usually, to figure out what objects are in S3, you have to issue an S3 list objects command. That operation tends to be relatively slow, especially if there are many objects.

Instead, when we put a new object, we update a table in Aurora which tracks all of the relevant objects. That way, we can query information like "what objects were uploaded in a certain time range" very quickly.

balls187 · on Dec 10, 2015

How do you ensure that Aurora and S3 stay in sync?

schmatz · on Dec 10, 2015

We have a worker consuming S3 events and updating the index. On a related note, we have experimented with Lambda to do something similar; AWS has done a fantastic job integrating their products :)

balls187 · on Dec 10, 2015

Cool, that part makes sense, but how are you ensuring they stay in sync?

That is, what happens if a db write fails? How are you handling concurrent updates? Is there a reconcile process that runs periodically?

(those details would be an awesome engineering blog post :) )

Edited to add:

Any reason that AWS Lambda didn't work out? Was it due to the public endpoint requirements?

Double Edited to add:

I totally geek out on composing AWS Services, and this is fascinating to me.

schmatz · on Dec 10, 2015

This is a good idea for a blog post! The way we have set up the system ensures that we requeue upon failure and concurrent updates are not an issue.

We are eagerly waiting for Lambda VPC support!

I'm happy you're interested in this sort of stuff; shoot me an email, let's chat :) michael@segment.com

chris_wot · on Dec 10, 2015

Hesitate to ask this, but will anyway: can you access the data from with Tableau or Qliview?

n2parko · on Dec 11, 2015

Yep! You can use those BI tools on top of your Redshift or Postgres instance :)

buremba · on Dec 10, 2015

I think that this is direction of analytics and we'll see products similar to this one in the next few years. The analytics companies realized that they can't answer all the questions their customers ask so they started to add this kind of features to their products, just look at custom applications of Mixpanel, Redshift integration of Amplitude or S3 integration of Keen.io.

The main reason that these companies implement these features to their infrastructure is to provide an alternative way to analyze data within their product in order to prevent losing their existing customers that need more advanced analytics features. (They are usually the biggest paying customers) The funny thing is that when you have an analytical database combined with a stream processing application, you can ask almost all questions you want to ask and get answers you need quickly enough so the value of their core product becomes less valuable when you have this alternative way.

I think that the BI tools such as Periscope and Mode Analytics realized this and started to promote their products as an analytics product rather than an application that creates charts from your data.

[Shameless plug] I'm also working on an open-source analytics platform (https://github.com/buremba/rakam) that collects data from clients (web, mobile or a smartwatch, doesn't matter), transforms (ip-to-geolocation, referrer extraction etc.) and stores data in a database that you specified. (currently there are two alternatives: Postgres and an in-house big data solution that uses PrestoDB as query engine)

Then, you to execute SQL queries, pre-aggregate your data for fast reports with continuous queries and cache query results with materialized views. Once you have these features, you can perform all analytical queries such as funnels, retention, segmentation etc. and create your custom analytics service easily.

sam-mueller · on Dec 10, 2015

This is a smart move by Segment, since the industry has been moving in this direction. Looks like mParticle launched support for redshift a few months ago:

http://blog.mparticle.com/mparticle-launches-next-generation...

TheBiv · on Dec 10, 2015

This design looks very similar to Stripes. Even the drop down in the header has the same action when clicked.

https://stripe.com/relay https://stripe.com/subscriptions

jpmw · on Dec 10, 2015

I guess colors were picked up based on psychology, and for design, yes, it's similar. Maybe it's actually the same person/people? Who knows! :)

TheBiv · on Dec 10, 2015

It definitely may be the same designer; who knows. It's just pretty strange that so many of the design elements of Stripe are 100% reflected on this companies page.

codezero · on Dec 10, 2015

Stripe's design is beautiful and if be happy if more people imitated or copied them :)

TheBiv · on Dec 11, 2015

I definitely agree that Stripe's design is beautiful, but this is just too close for me. I think it would be awesome if they built upon Stripes beautiful interface and built something nicely different!

TheLogothete · on Dec 10, 2015

I thought they offered this service for quite some time. What's changed?

n2parko · on Dec 10, 2015

Hey there! You’re right, we launched Redshift to our enterprise customers last November. There are three big changes today…

1) We’re lowering the price and opening it up to all our customers to make it more accessible 2) You can bring your own database. This is helpful for customers who already have a data warehouse and want to load Segment data into it. 3) We now support Postgres, in addition to Redshift

TheLogothete · on Dec 10, 2015

Thanks for the reply! Any plans on supporting other databases? Namely the Microsoft alternatives (SQL Server/SQL Datawarehouse)?

n2parko · on Dec 10, 2015

No problem! They're not yet on our near-term roadmap, would you mind submitting a request for your specific database here and we can keep you updated? https://segment.com/contact/integrations

physcab · on Dec 10, 2015

It wasn't immediately clear to me that I had to bring my own database. That makes this product less appealing. I would love it if I could instrument Segment and get a redshift endpoint to query my data, but didn't have to set it up myself.

ivolo · on Dec 10, 2015

Hey, Ilya here from Segment. We actually allow for both:

1. Self Hosted - bring your own Postgres/Redshift

2. Segment Managed - we give you a redshift endpoint (on our business tier)

The two options are described here: https://segment.com/blog/introducing-segment-warehouses-reds... and here's where you contact us regarding the hosted redshift: https://segment.com/contact/redshift

erichmond · on Dec 10, 2015

Before it was part of their high-end "business" plan. Now you can use it with any tiered plan. (I think).

jpmw · on Dec 10, 2015

I love how the pricing is clearly value based and not cost based. I can't imagine that this is massively more complex, but the price is significantly higher (and it's fine), basically, more enterprise-y. I love it!

Wondering if/how that will impact their bigger integration plans that includes a feature to replay data for new integrations you add after the fact.

sandGorgon · on Dec 11, 2015

Interesting... You compete with Alooma [1].

Your pricing is a bit out of range for most startups IMHO. you go from 0 to 400. I would love a 20$, 99$ and then 400$ tier.

I would go with Alooma if their pricing is right ... And replace most of my other analytics stacks.

Even Amplitude does this in their priced tier..in that you don't need your own redshift cluster (you get query access to your tables in their db).

[1]. https://news.ycombinator.com/item?id=10651425