Amazon Timestream – Fast, scalable, fully managed time series database

citilife · on Nov 28, 2018

At my day job, I build a lot of machine learning systems that require data to be fed in a time series manner[1].

Often this means building systems to analyze terabytes of logs [semi]-realtime. All I have to say is - thank god! This is going to make my job a lot easier, and likely empower us to remove our current infrastructure setup.

I know at one point we actually considered building our own time series database. Instead, we ended up utilizing a Kafka queue with an SQL based backend after we parsed and paired down the data, because it was the only one quick enough to do the queries.

Should make a lot of the modeling I've worked on a bit easier[1].

[1] https://medium.com/capital-one-tech/batch-and-streaming-in-t...

haggy · on Nov 28, 2018

I wouldn't jump the gun on this. I've been working within Amazon cloud for years and every year they make massive claims about new services at re:invent. Not saying this isn't going to be a good product, just saying it will probably take a while to be as useful as you're hoping.

gnahckire · on Nov 28, 2018

I agree w/ this.

They make insane promises but, the promises don't like up to expectations.

Kinesis analytics for example, can aggregate data across a time window (sliding window) from a stream (Kinesis). A huge issue that isn't document or stated is that when kinesis analytics restarts due to the process dies (being migrated, binpacked, etc.) the ENTIRE time window has to get re-aggregated. So your count drops to 0.

Really unacceptable if you're using it to generate KPIs which you alert on. We ended up switching to a system which pushes the stream data into influx and running the aggregations there via queries.

Dealing w/ AWS during this entire process was a huge pain.

kennu · on Nov 29, 2018

Similar examples have been Fargate and Lambda, which have surprisingly long cold start times depending on runtimes and VPC configurations. IMHO it is a big part of AWS expertise to know these things and be able to choose the right services not just according to the marketing brochures, but according to how they really work for each use case. Having said that, I'm glad to now have learned about that Kinesis Analytics restart issue.

lykr0n · on Nov 28, 2018

Have you looked at Clickhouse for timeseries data? It's the one database I've found that can scale and can query in near-realtime.

I've loaded a 100 Billion Rows in into a 5 shard database and can do full queries across the whole dataset in under 10 seconds. It also natively consumes multiple kafka topics.

hnmullany · on Dec 4, 2018

Specialized tools are specialized: just remember the limitations! - Bad with heterogeneous hardware (Cloudflare experience) - Non-throttled recovery (source replicas flooded with replication load) - No real delete/update support, and no transactions - No secondary keys - Own protocol (no MySQL protocol support) - Limited SQL support, and the joins implementation is different. If you are migrating from MySQL or Spark, you will probably have to re-write all queries with joins. - No window functions

valyala · on Dec 6, 2018

ClickHouse isn't general purpose DBMS. It is the best tool for collecting and near-online analyzing huge amount of events with many properties. We were successfully using ClickHouse cluster with 10 shards for collecting up to 3M events per second with 50 properties each (properties translate to columns). Each shard was running on n1-highmem-16 instance in Google Cloud. The cluster was able to scan tens of billions of rows per second for our queries. The scan performance was 100x better than on the previous highly tuned system built on PostgreSQL.

ClickHouse may be used as a timeseries backend, but currently it has a few drawbacks comparing to specialized solutions: - It has no efficient inverted index for fast metrics lookup by a set of label matchers. - It doesn't support delta coding yet - https://github.com/yandex/ClickHouse/issues/838 .

Learn how we created a startup - VictoriaMetrics - that builds on performance ideas from ClickHouse and solves the issues mentioned above - https://medium.com/devopslinks/victoriametrics-creating-the-... . Currently it has the highest performance/cost ratio comparing to competitors.

haggy · on Nov 28, 2018

> I've loaded a 100 Billion Rows

Have you done any load tests that would more closely mirror a production environment such as performing queries while clickhouse is handling a heavy insert load?

lykr0n · on Nov 28, 2018

I'm working on developing benchmarking tools for internal testing, but both Yandex and CloudFlare use Clickhouse for realtime querying. I'm still in development phase for my product, but I'll make sure to post information & results when we launch here.

https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

But I've spent a long time looking at the various solutions out there, and while ClickHouse is not perfect, I think it's the best multi-purpose database out there for large volumes of data. TimescaleDB is another one, but until they get sharding it's dead on arrival.

SEJeff · on Nov 28, 2018

I'd say kdb (from kx systems) is the best database for this problemspace, but it is prohibitively expensive for the http analytics use case. It is also a pain to query, but unbelievable what it can do.

haggy · on Nov 28, 2018

Very cool, I'll check this out!

lykr0n · on Nov 28, 2018

It's a quirky piece of software and has limitations that need to be considered when standing up a production cluster- such as you cannot reshard it currently. If you have a 3 node cluster, it's messy and requires downtime to add another node.

kermatt · on Nov 28, 2018

Still a bit messy, but the clickhouse-copier utility helps a bit: https://github.com/yandex/ClickHouse/issues/2579

lykr0n · on Nov 28, 2018

I have open issues about it on GitHub. It does not work correctly at this time. If you dig through the issues, there are statements by the devs saying the tool has been neglected.

askbjoernhansen · on Nov 29, 2018

We load millions of rows per second and use a handful (or more?) of materialized views to build appropriate summaries. Various clients make queries to the "raw data" and the views and it all works fine, basically.

haggy · on Dec 7, 2018

> it all works fine, basically

The "basically" is what intrigues me :D

StreamBright · on Nov 28, 2018

Do you have a schema available publicly? I would like to build a similar system using custom software + S3 + Parquet + Athena for this task and see if it works.

lykr0n · on Nov 28, 2018

The schema is

    CREATE TABLE points(Timestamp DateTime,Client String,Path String,Value Float32,Tags Nested(Key String,Value String)) ENGINE = MergeTree() ORDER BY (Client, Timestamp, Path) PARTITION BY toStartOfDay(Timestamp)

And this is a like query I was using

    SELECT (intDiv(toUInt32(Timestamp), 15) * 15) * 1000 as t, Path, Value as c FROM points_dist WHERE Path LIKE 'tst_val1' and Tags.Value[indexOf(Tags.Key, 'server')] = 'node' and Timestamp >= toDateTime(1543421708) GROUP BY t, Path, Value ORDER BY t, Path

This table was made on 5 servers via a distributed table partitioned on the timestamp- so distribution was even.

StreamBright · on Nov 29, 2018

Thanks! How big is the 100B rows in your system?

lykr0n · on Nov 29, 2018

Takes up 200gb or so across 5 servers (this is according to ClickHouse's query stats). Actual disk might be a bit higher.

StreamBright · on Nov 30, 2018

Thanks!

mathattack · on Nov 28, 2018

Have you considered columnal databases like 1010data?

3rdAccount · on Nov 28, 2018

I recall why this looks familiar. Their chief scientist put up a job posting on the subreddit for array languages. It sounds like they know how to use some pretty powerful tools.

https://www.reddit.com/r/apljk/comments/42uf2f/not_exactly_a...

mark_l_watson · on Nov 28, 2018

Keira, nice article. It will be a long time before we are allowed to use this new service in C1

sciurus · on Nov 28, 2018

This is not cheap for the "DevOps" use case.

Imagine you have 1000 servers submitting data to 100 timeseries each minute. That's 100,000 writes a minute (unless they support batch writes across series) At $0.50 per million writes that's $72 a day or $26k a year.

Now imagine you want to alert on that data. Say you have 100 monitors that each evaluate 1GB of data once a minute. At $10 per TB of data scanned, that's $1,440 a day or $525k a year!

Alex3917 · on Nov 28, 2018

Between RDS and these proprietary database products, AWS is now its own biggest competitor. That's potentially fine, but there is an inherent conflict of interest there and that needs to be properly managed, and while that may be happening it didn't come across in the keynote.

The only way I can have trust in Amazon's proprietary products is if RDS continues to get less expensive every year, since that is effectively the BATNA to these new products. It's been a while now since the last RDS cost reductions, and unless we continue to see more of those it's hard to have confidence that Amazon will continue to treat their customers of these new proprietary services fairly over the long term.

mritun · on Nov 28, 2018

That's convoluted... from you example, there are 100K writes/minute, while you assume 1GB data evaluated per minute per alarm. That is you're assuming 10K/item/timeseries for each alarm, while reality is going to be closer to 10-100 bytes/item/time-series, which cuts down the expense by two or three orders of magnitude.

sciurus · on Nov 29, 2018

That's a really good point, I was guesstimating very quickly when I wrote that. Depending on what your doing for the alarm and how they measure reads it may well be a lot less.

Let's say the read is 1MB instead of 1GB, that's now $1.44 a day and $525 a year. Query pricing becomes not so bad.

From my own experience, the 100 metrics per server estimate I was giving above is pretty low, though. Once you factor in different combinations of tags closer to 1000 is more realistic. That potentially brings up the write pricing quite a bit.

jaxxstorm · on Nov 28, 2018

Well, that depends on what you consider cheap. Hiring someone to manage a time series system like graphite or prometheus is going to cost you a whole lot more than $26k a year

sciurus · on Nov 28, 2018

You're ignoring half of my example scenario. $26k writing the data, $525k for querying it just for alerting, plus whatever it costs to store and to query ad-hoc. That's over half a million dollars. Even if you hire someone for $250k, you can self-host your time series system for cheaper than that.

Self-hosting isn't the only option though. For example, that hypothetical 1000 server scenario would cost $180k a year at list pricing on Datadog or SignalFX.

angel_j · on Nov 28, 2018

It's priced for government contracts.

willlll · on Nov 28, 2018

I'm actually impressed at how incredibly expensive they made this. $0.50 per million 1KB writes, which is 20x what aurora charges, since aurora allows 8KB writes. And Aurora is already expensive if you actually read/write to it.

abalone · on Nov 28, 2018

> $0.50 per million 1KB writes, which is 20x what aurora charges, since aurora allows 8KB writes.

That's a weird comparison. 20x is only true if you write 8KB with every entry, and you haven't included the storage and instance savings.

It's not hard to come up with suboptimal scenarios where this is more expensive, but that's missing the point. It's optimized for a specific kind of usage pattern.

barake · on Nov 28, 2018

The pricing lines up with CloudWatch Logs, 50 cents/GB, 3 cents/GB/month.

Curious to see what the query language is for this, wonder if they're just exposing the backing store for CloudWatch as a service now.

coder543 · on Nov 28, 2018

I really don't think it's 20x the cost of Aurora in general, considering that Aurora costs are not that simple, but I don't have time to run the numbers, so let's go with that for a moment. Do you really think Amazon would introduce a product that's 20x as expensive if they didn't know there was a market for it?

ravedave5 · on Nov 28, 2018

I get the feeling this is for important data (banking etc) so I have a feeling this is 200x cheaper than whatever else is available.

tonyedgecombe · on Nov 28, 2018

That's not what is says in the release:

"With Timestream, you can easily store and analyze log data for DevOps, sensor data for IoT applications, and industrial telemetry data for equipment maintenance."

Tehnix · on Nov 28, 2018

Quite excited for this! We have currently been experimenting with using DynamoDB, and managing our own rollups of our incoming data (previously on an RDS, which is not a good choice for this kind of data).

---

I've seen a lot of people complain about pricing, so I thought I'd share a little why we are excited about this:

We have approximately 280 devices out, monitoring production lines, sending aggregated data every 5 seconds, via MQTT to AWS IoT. The average messages published that we see is around ~2 million a day (equipment is often turned off, when not producing). The packet size is very small, and highly compressable, each below 1KB, but let's just make it 1KB.

We then currently funnel this data into Lambda, which processes it, and puts it into DynamoDB and handles rollups. The costs of that whole thing is approximately $20 a day (IoT, DynamoDB, Lambda and X-Ray), with Lambda+DynamoDB making up $17 of that cost.

Finally, our users look at this data, live, on dashboards, usually looking at the last 8 hours of data for a specific device. Let's throw around that there will be 10,000 queries each day, looking at the data of the day (2GB/day / 280devices = 0.007142857 GB/device/day).

---

Now, running the same numbers on the AWS Timestream pricing[0] (daily cost):

- Writes: 2million * $0.5/million = $1

- Memory store: 2 GB * $0.036 = $0.072

- SSD store: (2GB * 7days) * $0.01 (GB/day) * 7days = $0.98

- Magnetic store: (2 GB * 30 days) * $0.03 (GB/month) = $1.8

- Query: 10,0000 queries * 0.007142857GB/device/day --> 71GB = free until day 14, where it'll cost $10, so $20 a month.

Giving us: $1 + $0.072 + $0.98 + $1.8 + ($20/30) = $4.5/day.

From these (very) quick calculations, this means we could lower our cost from ~$20/day to ~$4.5/day. And that's not even taking into account that it removes our need to create/maintain our own custom solution.

I am probably missing some details, but it does look bright!

[0] https://aws.amazon.com/timestream/pricing/

sciurus · on Nov 28, 2018

It's got to be a rough day for the team at https://www.influxdata.com/ . This could become serious competition for their InfluxCloud hosted offering.

pauldix · on Nov 28, 2018

We've been expecting this for two years. It was just a matter of when. It validates our space. AWS did this to Elastic, they have competing products with NewRelic, Splunk, SumoLogic, and countless others. All of whom still have healthy businesses.

Our goal remains the same: build the best possible product that optimizes for developer productivity and happiness. And open source as much as we possibly can while maintaining a healthy business.

SEJeff · on Nov 28, 2018

Can't wait to see your new cloud offering :)

lozaning · on Nov 28, 2018

Amazon is the best partner until you prove worth cannibalizing.

hyperbovine · on Nov 28, 2018

This space is growing like mad. They would be remiss if they did not expect something like this.

dominotw · on Nov 28, 2018

and for timescaledb, streamlio, two sigma, kdb+ , quasardb ( and perhaps pipelinedb)

stingraycharles · on Nov 28, 2018

QuasarDB employee chiming in - we’re not actually worried, our clients are typically operating at a scale that would make costs very prohibitive for this AWS product.

Having said that, I can definitely see this be an interesting product for people doing less than 10k inserts per second.

dominotw · on Nov 28, 2018

> make costs very prohibitive for this AWS product.

Competing with AWS on just cost sounds worrying to me.

StreamBright · on Nov 28, 2018

Actually not really. We have migrated several customers to AWS and our experience with AWS services are kind of all over the place. S3 is super cheap and reliable and there is almost nothing that beats that but for example CloudWatch is extremely expensive for a large scale operation that is easy to beat with custom software (like Prometheus or something similar). I guess even Datadog would beat them (and they have much more advanced features and integrations). This means that there are multiple software vendors can exist in the same space even when Amazon has an offering in that space.

beginningguava · on Nov 28, 2018

At what point are open source projects going to change their licensing to prevent the major cloud providers from just stealing their products? I highly doubt AWS built this from scratch. Amazon, Google, and Microsoft are going to choke the life out of these projects

Redis and MongoDB at least seem to have woken up

https://www.geekwire.com/2018/open-source-companies-consider...

Jedd · on Nov 28, 2018

> At what point are open source projects going to change their licensing to prevent the major cloud providers from just stealing their products?

There's a lot to unravel in there.

I prefer 'free software' to 'open source' as it has a clearer meaning, especially in this context. Even so, no one can steal free / open source software (or as you say, product -- though that turn strongly implies a commercial offering).

By definition you can't really stop anyone from using your free software, unless perhaps you start naming companies explicitly, but I can't imagine it'd be an easy process, or have a happy outcome, if you started targeting 'major cloud providers' for special conditions.

Note that I am not an apologist for AWS, Google, Microsoft, etc - but it feels like the fundamental problem here is not massive corporations charging other people to access free software.

SEJeff · on Nov 28, 2018

Free software and open source aren't always the same thing though. Free Software is software that through the license enforces a philosophy.

Open source is software that through the license enforces the source code to remain open.

I'm not a fan of RMS or his attitudes on most things, but am a strong OSS fan as it is the best way to develop and maintain software.

Jedd · on Nov 29, 2018

> Free software and open source aren't always the same thing though.

Entirely agree, hence I drew the distinction. I eschew 'open source' as it's highly ambiguous, and mostly misses the point.

> Free Software is software that through the license enforces a philosophy.

I would disagree. Free software ensures the user has certain freedoms.

> Open source is software that through the license enforces the source code to remain open.

This is a very circular definition -- open source is open.

> I'm not a fan of RMS or his attitudes on most things, but am a strong OSS fan as it is the best way to develop and maintain software.

As it happens, rms is no fan of OSS.

SEJeff · on Nov 29, 2018

RMS has said publically that proprietary software should be illegal, and that free software guarantees user freedoms. This is a philosophy, and one that opines every user wants to become a developer. For things like emacs, and much of GNU, this is true, for most of the rest of the world, this is not.

I eschew free software because I'm not about forcing my views on others (which is literally the mission of GNU). I'm about developing software to be the best it can be, and maybe meeting some friends along the way. Open source, being the best software development model overall, allows me to meet that goal. You could almost say some of RMS's more extreme quirks border on authoritarian (see the example with the abortion joke in the libc manual he FORBADE removal of and demanded be re-added when a dev simply overruled him). He's not acting in a manner that encourages "freedom", but as a simple and obvious dictator of all things GNU or claiming to be GNU. He's frequently tried to shape the path of GNOME (which I am a former foundation member and was on the sysadmin team) in areas he literally has no business weighing in on. Then there are some more gross personality problems, like his sexism, or tendency to actually eat his toenail gunk[1], or to ever refuse to be wrong on anything, even when an entire community disagrees with him.

Dr Stallman has done a great deal of good for the world with Free Software, however like the VAX and PDP-11, his time has passed. Open source won just like Linux won over GNU/Hurd. It is ok that he won, by losing.

[1] https://www.youtube.com/watch?v=I25UeVXrEHQ

Jedd · on Nov 30, 2018

rms's toenails aside, I'd suggest that every licence is about enforcing a set of views on others. There are plenty of licences to choose from, so it's fairly easy to find one compatible with your own views.

In the context of GP's (beginningguava) comment about 'open source projects' needing to change their licensing to prevent corporations making money by SaaSing various tools, my point was twofold - first, by definition you can't have free software with restrictions like that, and second you'd be merely fighting the symptoms (with little chance of success).

Aside - I'm curious what you mean by the 'open source software development model', as I don't think that's actually a thing.

SEJeff · on Dec 3, 2018

Fair enough, the open source software development model is no different in reality from the free software development model generally speaking.

It goes back to ESR's The Cathedral and the Bazaar and is what he deems "the bazaar model" or "bazaar style" before him, Larry Augustin, and Bruce Perens (if memory serves) went on to coin the phrase "open source". Even if you don't necessarily agree with ESR (I see him in a similar vein as RMS fwiw), his thoughts on software development models have generally speaking, been proven true.

ric2b · on Nov 30, 2018

> Free Software is software that through the license enforces a philosophy.

This is incorrect, licenses like MIT or BSD are also free software license because they afford the 4 freedoms to the user (even though they don't enforce them on derivative works).

Licenses like the Redis one are open-source but not free software because they place limitations on the user (can't sell hosted service, IIRC)

antirez · on Nov 30, 2018

Redis is BSD licensed. You mean some Redis module developed independently by Redis Labs that are not part of Redis itself.

lima · on Nov 28, 2018

Why not? InfluxDB does not have these capabilities.

mritun · on Nov 29, 2018

> I highly doubt AWS built this from scratch

Unfounded doubt.

addisonj · on Nov 28, 2018

Nice to see, this has felt like a gap in cloud offerings for a while... and the open source options have difficulties.

From the little that was said, going to guess this uses something like Beringei (https://code.fb.com/core-data/beringei-a-high-performance-ti...) under the hood

plasma · on Nov 28, 2018

The financial read cost of this database makes it practically unusable for customer facing dashboards, disappointing.

axus · on Nov 28, 2018

A place to put the timestamped data they download from yesterday's Amazon Ground Station.

brootstrap · on Nov 28, 2018

Been searching for years to a good alternative to postgres for storing gobs of weather timeseries data. So far we have been running postgres system for many years in production and have hired multiple contractors to implement a 'real timeseries solution'. All of which have been utter shit and complete failures. The AWS services are expensive as all hell. With a little bit of imagination we created a unique schema for timeseries data that doesnt require terabytes of space, and processes billions of data points a day, and has blazing fast queries into said data.

jws · on Nov 28, 2018

I moved a decade's worth of weather time series data from well indexed Sqlite to InfluxDB and was nothing but pleased. It ended up taking an order of magnitude less storage and so much faster to query that I didn't even bother to benchmark it. You can probably write a simple query to your Postgres database to cough out the text file to load InfluxDB to see how it works for you. Then it comes down to how easy it is to replace your query and insert functions… So it's all easy except for the hard part.

porker · on Nov 28, 2018

Did you try https://www.timescale.com/?

DelaneyM · on Nov 28, 2018

I once handled gobs of web tracking data in Cassandra, using Hadoop over Cassandra to build queryable rollups in MySQL and Pig for on the fly analysis

Neither tech works on its own, but together (substituting Cassandra columns for hdfs) was magic for the specific data configuration & use case.

samstave · on Nov 28, 2018

So what will this compare wrt boundary, signalfx, stackdriver, etc types of previous services...

Ill have to go look into this, because if aws historic pricing for any large volume stream, quickly becomes untennable.

Its very easy to have gobs and gobs of time series points... aws might make using this way too expensive for anything at relative scale for a small startup?

spullara · on Nov 28, 2018

It appears that ingestion alone is more expensive than the commercial metric services. Might not matter for small scale.

brian_herman__ · on Nov 28, 2018

I wonder how this compares to KDB

swellswamp · on Nov 28, 2018

Since kdb is SO good with time series data, I would need a lot of tech details to even know if AWS is worth testing. It's a LOT of work to test a time series db at scale.

SEJeff · on Nov 28, 2018

The core of Amazon's timeseries db likely doesn't fit into a CPU's L1 cache. It does with KDB :)

cthalupa · on Nov 29, 2018

>The core of Amazon's timeseries db likely doesn't fit into a CPU's L1 cache. It does with KDB :)

Does it?

https://kx.com/discover/in-memory-computing/ seems to indicate that it takes up ~600 Kb (I'm not sure if this is bits or bytes, but even if it's bits, that turns into 75KB)

L1 cache is per core. Skylake Xeons have a 64KB cache per core, 32KB for data and 32KB for instructions. Even with an even split there, you're not fitting 75KB (or 600KB) into the L1 cache.

Bits would be a weird measurement to use when talking about memory utilization, so I'm pretty sure that it's 600 kilobytes. You're not anywhere close to fitting that into the L1 cache. L2 cache, sure. But you get the relatively spacious 1 megabyte for L2.

I'm also not sure that the "core" fitting into the CPU cache is particularly meaningful for performance anyway. it doesn't say anything about how much outside of the core gets used, how big the working set size is for your workload, how much meaningful work is done on that working set of data, etc. If you're frequently using parts of the software that don't fit in the cache, or getting evicted from it for other code, or your working set of data doesn't fit in the cache and you're constantly going to main memory for the data you're working on, the "core" fitting in L1 cache (or L2 cache, which looks more realistic) is going to be basically meaningless.

SEJeff · on Nov 29, 2018

Gah, I meant L2 cache, but was being entirely too smug. I remember a presentation a KX rep gave at our office a few jobs ago where this was one of their bullet points. I found it amusing, and a bit odd.

cthalupa · on Nov 29, 2018

Gotcha! Definitely an interesting marketing point. I probably would have had the same reaction :)

hellomichibye · on Nov 28, 2018

Same for me:)

probdist · on Nov 28, 2018

Seems positioned to compete with Azure Data Explorer (MSFT's log/time series optimized service). I know Azure runs a lot of services on top of Data Explorer (previously called Kusto) I wonder if this is a true internal battle tested product or a me-too offering.

adeal · on Nov 28, 2018

I use Kusto daily and it is by far my favorite dev tool. It's incredibly fast and actually fun to use. I'm interested to compare it to Timeseries.

Kusto is still the name of the query language and the desktop application (Kusto.Explorer), the service was just renamed to Azure Data Explorer.

tjungblut · on Nov 28, 2018

Kusto is architecturally closer to Dremel (or BigQuery). It's a columnar compressed datastore with a nice query language. Not the most efficient way to store and query timeseries data though.

Back then (internally) we actually had a lot of issues with ingesting and querying time series data at scale.

nkassis · on Nov 28, 2018

I might be mistaken but wouldn't Data Explorer be more similar to AWS CloudWatch which has been around for a long time.

probdist · on Nov 28, 2018

Azure Data Explorer/ Kusto is more of a database that is optimized for the log use case than a service. There is a front end tool and a lot of the use-cases are around log management, but it is database you can do general SQL or KQL things with. Time series is one of the core use-cases for it also but it has less marketing around it.

spullara · on Nov 28, 2018

Sounds more like AWS CloudWatch Log Insights launched yesterday.

erikcw · on Nov 28, 2018

Seems like this could be a great remote storage backend for Prometheus.

nabadu87 · on Nov 28, 2018

Oh yeah! Would that need a custom storage peovider in Prometheus?

erikcw · on Nov 28, 2018

Yes [0]. I haven't had time to fully look into the details. But from looking at the existing integration -- it would be fantastically convenient to have something cloud native [1].

[0] https://prometheus.io/docs/prometheus/latest/storage/#remote...

[1] https://prometheus.io/docs/operating/integrations/#remote-en...

temuze · on Nov 28, 2018

Honest question: when dealing with time-series data, do you actually need every data point? Is that level of granularity really necessary?

IMO, it makes way more sense to decide the aggregations you want ahead of time (e.g. "SELECT customer, sum(value) FROM purchases GROUP BY customer"). That way, you deal with substantially less data and everything becomes a whole lot simpler.

bruth · on Nov 28, 2018

Really depends on the use case. Working in healthcare, vital signs can be modeled as time series points, but are lower frequency than, say, metrics from servers. However we want to store every point so a spike is not missed. One could argue an unsustained spike is noise, but in the healthcare domain there may be a correlation with some external event (the purpose is surprised and their heart rate spikes).

have_faith · on Nov 28, 2018

The clever thing to do in this scenario would be to keep every spike but delete all the data between similar data points after storing. So you get low granularity for identical/nearly-the-same data points and high granularity when something interesting happens. I don't have any experience with time-series data so maybe this is commonplace.

lozenge · on Nov 28, 2018

That would be impossible to run any new analyses on.

What some would do is record in blocks where every point after the earliest is stored as a delta. Then each block is more compressible as it contains a lot of 0s.

ZeroCool2u · on Nov 28, 2018

In finance it can be critical.

Some tasks actually require absolute granularity, up to 6 decimal places of precision and thereafter reliance on atomic order of arrival, for deterministic results on data from high frequency trading.

Without absolute knowledge of the order or if there's aggregation the best you can do is approximate, which often is considered suboptimal when the real solution is available.

ilaksh · on Nov 28, 2018

Sure you can do if you're really sure that you won't need to group by something else later. You wouldn't want to store more granularity than necessary but you can't go back in time to get a data point you didn't store.

temuze · on Nov 28, 2018

In that case, can you just store each data point in a data lake somewhere and do a batch-job? Apache Flink supports this use case as well as real-time.

ilaksh · on Nov 28, 2018

Yes but I guess the point of using time series DB rather than just a lake that doesn't necessarily have a time structure is that if you know time is going to be important then you probably want to organize and query it that way.

What I am doing with one program could almost be called a data lake because it is just a bunch of JSONL files that have really varied data in them. But it's organized by date and hour per day as well as predefined keys, since I know I will need to query it that way.

MagicPropmaker · on Nov 28, 2018

We had applications where we were tracking guests in a venue through various means. We tried a number of queuing systems to manage the flood of events, but they'd all fall over. I'll love to run my old "venue simulator" through this and see if it can stand up to actual guest load as they walk around, ride, purchase things.

coredog64 · on Nov 28, 2018

I'm wondering if this shares any technology with the CloudWatch metrics backend. They've been making improvements there all year, and most of them generally align with what's announced here.

CloudWatch metrics are also very expensive for what you get, so that's another similarity to Timestream ;)

taf2 · on Nov 28, 2018

I couldn’t tell from the page is this SQL based similar to timescale or a more similar to influxdb?

mharroun · on Nov 28, 2018

This is looking like a managed druid... that would be very nice to have.

tjholowaychuk · on Dec 12, 2018

Anyone know if this is what CloudWatch Insights uses? If so, it doesn't even come close to competing with Elasticsearch performance (with a tiny cluster), it seemed quite slow.

inoiox · on Nov 28, 2018

There have been a lot of amazon links this week

philwelch · on Nov 28, 2018

AWS Reinvent is happening. Kind of like how's there's a lot of Google links during IO or a lot of Apple links during WWDC.

jedberg · on Nov 28, 2018

Re:invent week. Expect a lot more.

jopsen · on Nov 28, 2018

where is the docs?

booleandilemma · on Nov 28, 2018

There are at least 7 Amazon-related stories on the HN front page right now, what’s going on?

dang · on Nov 28, 2018

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

bruth · on Nov 28, 2018

AWS re:Invent

samstave · on Nov 28, 2018

Merry christmas?

dschuetz · on Nov 28, 2018

[flagged]

dang · on Nov 28, 2018

Most big tech companies bundle their announcements on their annual conference day. It happens several times a year and these same complaints show up every time, but spamming the threads with them is excessive.

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

superkuh · on Nov 28, 2018

A quick look at the Hacker News frontpage shows a bit of a problem,

    1.  Amazon Timestream (amazon.com)
    3.  Amazon Quantum Ledger Database (amazon.com)
    8.  Amazon FSx for Lustre (amazon.com)
    13. AWS DynamoDB On-Demand (amazon.com)
    14. Amazon's homegrown Graviton processor was very nearly an AMD Arm CPU (theregister.co.uk)
    21. Building an Alexa-Powered Electric Blanket (shkspr.mobi)
    30. Amazon FSx for Windows File Server (amazon.com)

dang · on Nov 28, 2018

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...

superkuh · on Nov 28, 2018

Ah, sorry. I should've known you'd already be on it. Thanks.

vemv · on Nov 28, 2018

Whether we are fans or not, keeping up with (and discussing) new AWS offerings is extremely relevant for practitioners.

whoisjuan · on Nov 28, 2018

AWS Re:Invent is happening in Las Vegas this week. AWS is launching new services all this week. People here are interested in cloud services, so it makes sense that people share and upvote those announcements. I don't see the problem that you're talking about.

nimbius · on Nov 28, 2018

jesus christ six amazon articles in a day? AWS is undeniably the body of christ for HN but am i missing something? FSX, blockchain, timestream, Graviton, ground station, and cloudwatch... all of these articles are advertisements for mundane shit.

danso · on Nov 28, 2018

It's AWS re:Invent day/week. I've never been, but I get the impression that it's like Apple's keynote, or Google I/O, in which big product announcements are made. On those days, you'll see multiple submissions about the respective conferences too.

cheeze · on Nov 28, 2018

This is exactly what re:Invent is. Most teams dream of launching a new AWS product at re:Invent (and not missing their date and launching at a later time)

danso · on Nov 28, 2018

It'd be an interesting blog post topic to look at how AWS:Invent (or I/O, or F8, etc) product threads on HN compare to actual product impact. I remember Rekognition getting decent discussion 2 years ago [0], but not along the angles or magnitude of how Rekognition is usually discussed in recent months. OTOH, other things I've been interested in as a data geek, I've barely heard of since reading about them on HN -- e.g. Athena [1]

[0] https://news.ycombinator.com/item?id=13072956

[1] https://news.ycombinator.com/item?id=13072245

meritt · on Nov 28, 2018

https://reinvent.awsevents.com/

It happens every year during Google, Apple, Amazon, and Facebook events.

nimbius · on Nov 28, 2018

thanks. Just glanced at the shop calendar and i guess the holidays are right around the corner too.

jedberg · on Nov 28, 2018

There will be a lot more. The keynote today still has an hour left, and there is two more hours of keynotes tomorrow.

(Written from keynote floor)

samstave · on Nov 28, 2018

Where can I see your keynote-notes/updates?

jedberg · on Nov 28, 2018

I don’t usually do that, I just share my thoughts as comments on HN. There’s enough people doing that other stuff. :)

samstave · on Nov 28, 2018

So.... whats your biggest takeaway thus far then? ;-)

EDIT: Is mobot a dead project?

jedberg · on Nov 28, 2018

AI AI AI AI.

jonknee · on Nov 28, 2018

> all of these articles are advertisements for mundane shit

What do you think hacker news is supposed to be? Anything that isn't mundane to a lot of people isn't interesting enough for those in the target audience. Amazon is having its annual AWS conference and thus has a lot of announcements, of course they have a lot of new niche products.

mLuby · on Nov 28, 2018

I count 7 separate Amazon posts on the front of HN. Is this some conspiracy? #NotAmused #ShouldBeBundled

twistedpair · on Nov 28, 2018

The conspiracy is called re:Invent