The title should say 'Amazon Redshift'.
At first I thought its going to be about redshift vs f.lux:
http://jonls.dk/redshift/
Edit: Why the downvote?
redshift (and flux) exist since before 2010, whereas Amazon Redshift got introduced just in 2012.
I think it is reasonable to assume that someone who has never heard of Amazon Redshift would think of the open source project first (that exists in various distributions as packages), and not the Amazon service.
If we took a poll I suspect the majority would be thinking of the Amazon service - I know I was. The date the projects were introduced isn't necessarily relevant.
I'm much more tired of comments on ambiguous naming. There are at least two other people in my city who have my name, and many more who share either my first or last name. Somehow life goes on and this is not a topic of major controversy. But when two pieces of software have similar names, people just can't resist commenting endlessly and upvoting this content-free bikeshedding at the expense of actual discussion.
There are at least two other people in my city who have my name
I'm presented with information to parse about technology topics daily. Sometimes, I have to search for them and have all sorts of name collisions.
I never, ever search for information about you.
people just can't resist commenting endlessly and upvoting this content-free bikeshedding at the expense of actual discussion
One man's bike shedding is another thousand men's irritating trend. Personally, I very rarely see anyone called out for the trendy names and awful, buzzword-laden non-descriptions that infest projects.
Yes! Especially when people use an existing word like 'redshift' as their product name. (It seems to be a popular choice! I remember there also was this astronomy software for the Mac called Redshift.)
Of course, it gets even more ridiculous as the words get more common, eg see recent discussions about 'Paper' or 'Layout'
It's not 'just' personal projection, nor irrelevant, if accurate. Obviously you'd have to run the experiment to find out for sure, but it's not meaningless to contribute "I would expect anyone I know to think of Amazon's Redshift first". For someone who doesn't know that others feel the opposite definition is the 'default', it might be useful to find out.
Far weirder things have been blogged about before.
Not to mention that if you are not already familiar with the company, you probably would not know what sort of blog it was just by looking at what was presented on the HN frontpage. Could have been some random joe-schmoes blog.
FWIW, I didn't downvote you but you're also kind of hijacking the discussion (especially since this comment is the top comment and it really shouldn't be).
I wish he would talk about how they protect one customer from running a query that brings down the full stack. When we permitted Tableau to start talking to Redshift, we frequently encountered "Oh crap, Peter is running that query and and that's why everything is at a stand-still..."
You can set up Workload Management[1] to restrict the amount of compute / query_slots each query/user can use.
It splits the memory/compute into slices, and queries can use multiple slices, so you can get some fine-grained control, but it takes a bunch of work.
Amplitude here - Most of our dashboards are powered separately from Redshift. We offer Redshift access as a way for our customers to answer more complex questions not offered by the dashboards.
Yeah, but a data warehouse isn't supposed to have great response times. Data warehouses are for large, low-value sets of historical data that you don't always know how you want to use.
If you want to use data in real-time, you should be driving it from your transactional systems. Redshift and other data warehouse solutions are for doing reporting and dashboards, not triggering real-time reactions.
Well, used to be true, but now those systems are converging. -- Full disclosure, I work for a company working on exactly that problem called Treasure Data.
Most companies are generally more concerned about reducing their data warehouse costs than they are about improving the performance of their data warehouses. Many companies implement a multi-tiered DW structure to get a mix of the two, but the core driver is managing the cost of storing petabytes of data while keeping performance acceptable.
We update the table schema as we run into new fields, up to a limit. We also store the unstructured part of the data in a column that can be queried via json_extract_path_text.
You have a complex product/service, with very diverse application scenarios.
=> Customer adoption is hindered by this complexity.
=> Customer is not getting value
=> Customer is angry and stops paying
You hire a person who is familiar with the applications of your technology. He talks to customers to figure out what they want to do, how they plan to achieve it, what the hurdles. He helps them. Writes best practices, implementation plans, helps marketing to position and sales to close.
It turns out so good that you hire many such people, who specialize in particular customer segments. Those people need management.
You need Director of customer success.
It's something between account manager/service/marketing.
No, clusters are multi-tenant. We have a cap on the number of customers per cluster and we monitor usage to make sure no one customer is hammering the cluster.
Put your data in S3, in csv/tsv/json format, if you want to switch to other provider, just figure out how to import it, and your are all set. How to figure out the limitation of the different platforms and tuning and optimizing is the difficult part.
Data migration is almost always painful and time-spending. When choosing your data provider, you have to be careful because it is very likely to be a long-term commitment. In that sense, in DW world there is always vendor lock-in. Only it is largely driven by the essence of the application itself, less so by the intention of the provider.
Compared to the lock-in of the AWS ecosystem in general, Redshift honestly isn't that bad. You can unload all of your data into S3 and then do whatever you want with it. I'd be surprised if most data warehousing solutions had such an easy way of exporting the data.
In addition, if you store your data in S3 and have Redshift load it from there then you don't even need to do an export - just leave your source data in S3 after Redshift's loaded it, and you're all ready to switch to another platform.
If you want to move to another DW platform, it's probably not going to be Postgres-based. As every vendor has a slightly different flavor of SQL with different behaviors, this will require redesigning your queries, schemas, and most if not all of your stored procedures. Depending on the company and age of the platform, this could be many thousands of hours of work.
Really, vendor lock-in is pretty much a given with data warehousing platforms. Though these days, it's not uncommon for large companies to have multiple DW platforms all pulling data from each other. When one platform falls out of favor, the users just migrate themselves to another since most reporting systems not made by SAP or Oracle are compatible with pretty much everything.
In contrast, Vertica, Greenplum, Netezza, Teradata Aster, and CitusDB are all based on PostgreSQL forks. In many cases, the client libraries behave like psql, and ease conversions at that level.
As to SQL language differences, no DW platform uses "standard SQL", just as no two RDBMS use the exact same SQL dialect.
I dare say database platform lock-in is a universal issue. Any migration will involve effort.
To be fair, that kind of comes with the territory when talking about data warehousing. The data volumes are so large that migrating them is usually out of the question, and query languages vary between vendors pretty significantly.
Edit: Why the downvote? redshift (and flux) exist since before 2010, whereas Amazon Redshift got introduced just in 2012. I think it is reasonable to assume that someone who has never heard of Amazon Redshift would think of the open source project first (that exists in various distributions as packages), and not the Amazon service.