I'd be interested in hearing from anybody using RedShift in production. Currently shopping around for a database to store 1.5 billion interesting things, and RedShift's at the top of my list.
Specifically - how many (small/large) nodes are you using, how big is your data, how many simultaneous users does your setup support, and what kind of performance do you see?
Not in production yet, but my company's in the process of moving from Infobright to Redshift, mostly because Infobright, while extremely fast, is a single-server solution and we think that within a year we'll have outgrown its capacity.
Our data is 3 denormalized tables with the same ~1000 column schema, the largest of which contains 7B rows. Most of our queries are simple aggregations down a few columns, with simple constraints like date range + customer ID.
So far we're really impressed with Redshift's load performance and ease of getting up-and-running, but we're still an order of magnitude away from Infobright's query performance. Next steps are playing with distribution and sort keys, as well as trying different cluster configurations; to be fair to Redshift we've not run it on more than a 6 XL cluster yet.
We use Redshift in production to power a customer-facing app, but don't have web apps hit it directly (that really couldn't work with the concurrent query limits). Workers query RS and cache results in another database and the web app hits that database. We handle writes in the low 10s of billions per day. Our data format is very simple and we get great compression, so we currently fit everything on two clusters — one of 3 XLs and one of 6 XLs.
Performance is great — except when it's not. Simple aggregations on tables that are < 1B rows and two table joins on tables that are < 100M rows are blazingly fast (maybe 1-30 seconds depending on the query). Larger tables than those and it can start to crawl.
We're using it as a replacement for our enterprise data warehouse. Total size so far ~1 TB compressed, daily delta ~10s of GB compressed. We're using a two XL node cluster. The best part is that you can increase the cluster size without downtime (except you can't switch from XL to 8XL this way). Right now we have a single digit number of simultaneous users (analysts). Did not have to tweak the Workload Management configurations too much so far.
I'd be interested to hear from others with bigger workload than ours. For example, someone from Netflix (They switched their data warehouse from Vertica to Redshift this year)
I have not used Redshift in production but it's been getting pretty good reviews thus far. I did some testing using pretty small dataset which you can read more about here: http://bicortex.com/amazon-redshift-review-data-warehouse-in...
I don't think it's the answer to all data storage/warehousing needs but it does look promising and it should only get better.
Only tangentially related to your comment, but last year there was a series of postings "Damn Cool Algorithms" in which the last post covered HyperLogLog in the post on cardinality estimation [0]. Some discussion in the HackerNews submission [1].
one of the biggest benefits of HLL is that if you are only interested in set cardinalities, you don't have to store the original items at all, just a small HLL data structure which can be used to compute set unions, and with some caveats, intersections.
It would be really great if Redshift supported HLL as a proper data type, and not just an optimization for COUNT(). A proper data type would allow us to store pre-aggregated data instead of billions of unnecessarily granular rows.
Specifically - how many (small/large) nodes are you using, how big is your data, how many simultaneous users does your setup support, and what kind of performance do you see?
Thanks in advance.