Monitoring your own infrastructure using Grafana, InfluxDB, and CollectD

avthar · on July 21, 2020

Echoing the sentiment expressed by others here, for a scalable time-series database that continues to invest in its community and plays well with others, please check out TimescaleDB.

We (I work at TimescaleDB) recently announced that multi-node TimescaleDB will be available for free, specifically as a way to keep investing in our community: https://blog.timescale.com/blog/multi-node-petabyte-scale-ti...

Today TimescaleDB outperforms InfluxDB across almost all dimensions (credit goes to our database team!), especially for high-cardinality workloads: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

TimescaleDB also works with Grafana, Prometheus, Telegraf, Kafka, Apache Spark, Tableau, Django, Rails, anything that speaks SQL...

gshulegaard · on July 21, 2020

Just wanted to say I am super impressed with the work TimescaleDB has been doing.

Previously at NGINX I was part of a team that built out a sharded timeseries database using Postgres 9.4. When I left it was ingesting ~2 TB worth of monitoring data a day (so not super large, but not trivial either).

Currently I have built out a data warehouse using Postgres 11 and Citus. Only reason I didn't use TimescaleDB was lack of multi-node support in October of last year.

I sort of view TimescaleDB as the next evolution of this style of Postgres scaling. I think in a year or so I will be very seriously looking at migrating to TimescaleDB, but for now Citus is adequate (with some rough edges) for our needs.

roskilli · on July 21, 2020

If you're looking at scaling monitoring timeseries data you may also wanter to consider more Availability leaning architecture (in the CAP theory sense) with respect to replication (i.e. quorum write/read replication, strictly not leader/follower - active/passive architecture) then you might also want to check out the Apache 2 project M3 and M3DB at m3db.io.

I am biased obviously as a contributor. Having said that I think it's always worth understanding active/passive type replication and the implications and see how other solutions handle this scaling and reliability problem to better understand the underlying challenges that will be faced with instance upgrades, failover and failures in a cluster.

gshulegaard · on July 21, 2020

Neat! Hadn't heard of M3DB before, but cursory poke around the docs seems like it's a pretty solid solution/approach.

My current use case isn't monitoring, or even time series anymore, but will keep M3DB in mind next time I have to seriously push a time series/monitoring solution.

valyala · on July 22, 2020

Did you try ClickHouse? [1]

We were successfully ingesting hundreds of billions of ad serving events per day to it. It is much faster at query speed than any Postgres-based database (for instance, it may scan tens of billions of rows per second on a single node). And it scales to many nodes.

While it is possible to store monitoring data to ClickHouse, it may be non-trivial to set up. So we decided creating VictoriaMetrics [2]. It is built on design ideas from ClickHouse, so it features high performance additionally to ease of setup and operation. This is proved by publicly available case studies [3].

[1] https://clickhouse.tech/

[2] https://github.com/VictoriaMetrics/VictoriaMetrics/

[3] https://victoriametrics.github.io/CaseStudies.html

gshulegaard · on July 22, 2020

ClickHouse's intial release was circa 2016 IIRC. The work I was doing at NGINX predates ClickHouse's initial release by 1-2 years.

ClickHouse was certainly something we evaluated later on when we were looking at moving to a true columnar storage approach, but like most columnar systems there are trade-offs.

* Partial SQL support.

* No transactions (not ACID).

* Certain workloads are less efficient like updates and deletes, or single key look ups.

None of these are unique to ClickHouse, they are fairly well known trade-offs most columnar stores make to improve write throughput and prioritize high scaling sequential read performance. As I mentioned before, the amount of data we were ingesting never really reached the limits of even Postgres 9.4, so we didn't feel like we had to make those trade-offs...yet.

I would imagine that servicing ad events is several factors larger scale than we were dealing with.

douglasheriot · on July 22, 2020

I've been using InfluxDB, but not satisifed with limited InfluxQL, or over-complicated Flux query languages. I love Postgres so TimescaleDB looks awesome.

The main issue I've got is how to actually get data into TimescaleDB. We use telegraf right now, but the telegraf Postgres output pull request still hasn't been merged: https://github.com/influxdata/telegraf/pull/3428

Any progress on this?

akulkarni · on July 22, 2020

There is a telegraf binary available here that connects to TimescaleDB: https://docs.timescale.com/latest/tutorials/telegraf-output-...

If you are looking to migrate data, then you might also want to explore this tool: https://www.outfluxdata.com/

valyala · on July 22, 2020

I believe PromQL [1] and MetricsQL [2] are much better suited for typical queries over time series data than SQL, Flux or InfluxQL.

[1] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

[2] https://victoriametrics.github.io/MetricsQL.html

jimaek · on July 21, 2020

While influx is pretty bad overall it's super simple to deploy and configure unlike timescale.

It's the main reason we decided to use influx in our small team with simple enough timeseries needs

avthar · on July 21, 2020

Curious what you found difficult to deploy/ configure? Is this in a self-managed context?

themgt · on July 21, 2020

I was testing out both TimescaleDB/InfluxDB recently, including maybe using with Prometheus/Grafana. I was leaning towards Timescale, but InfluxDB was indeed a lot easier to quickly boot a "batteries included" setup and start working with live data.

I eventually spent a while reading about Timescale 1 vs 2, and testing the pg_prometheus[1] adapter and started thinking through integrating its schema to our other needs then realizing it's "sunsetted" and then reading about the new timescale-prometheus[2] adapter and reading through its ongoing design doc[3] with updated schema that I'm less a fan of.

I finally wound up mostly-settling on Timescale although I've put the Prometheus extension question on hold, just pulling in metrics data and outputting with ChartJS and some basic queries got me a lot closer to done for now. Our use case may be a little odd regardless, but I think a timescale-prometheus extension with a some ability to customize how the data is persisted would be quite useful.

[1] https://github.com/timescale/pg_prometheus

[2] https://github.com/timescale/timescale-prometheus

[3] https://docs.google.com/document/d/1e3mAN3eHUpQ2JHDvnmkmn_9r...

cevian · on July 21, 2020

(TimescaleDB engineer) Really curious about finding out more about your reservations about the updated schema. All criticisms welcome.

themgt · on July 21, 2020

Thanks, as I say our use case may be too odd to be worth supporting, but effectively we're trying to add a basic metrics (prom/ad-hoc) feature to an existing product (using Postgres) with an existing sort of opinionated "ORM"/toolkit for inserting/querying data.

Because of that and the small scale required, the choice of table-per-metric would be a tough fit and I think a single table with JSONB and maybe some partial indexes is going to work a lot better for us. It would just be nice if we could somehow code in our schema mapping and use the supported extension, but I get it may be too baked-into the implementation.

Anyway, overall we're quite happy with TimescaleDB!

cevian · on July 22, 2020

I see, that makes a whole lot of sense. This is an interesting use-case. It actually may be possible to use some of the aggregates provided by the extension even with a different schema. If you are interested in exploring further, my username on slack is `mat` and my email is the same @timescale.com

samstave · on July 21, 2020

I found this to be a funny, subtle insult. :-)

"Why is it difficult, because you're self managed?"

znpy · on July 21, 2020

I think it's a honest question. Maybe they want to find out if setting up needs a bit of us work or simplifying.

avthar · on July 21, 2020

No intention to insult - just trying to get more context :)

samstave · on July 21, 2020

I know - I just read it as such and chuckled.

souldeux · on July 21, 2020

What's wrong with influx? I use it and like it, albeit for hobby-level projects.

chucky_z · on July 21, 2020

Influx is so good for hobby-level projects. So good!!

For serious applications, it doesn't cut it. Trying to do more than a few TB a day is a waste of time outside of enterprise, which ain't cheap.

I plopped VictoriaMetrics in place of Influx for my cases and haven't even had a single hiccup.

mhall119 · on July 21, 2020

As with any technology, the right tool for the job is highly dependent on what the job is.

valyala · on July 22, 2020

InfluxDB may have slightly big RAM requirements when working with high number of time series [1].

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

valyala · on July 22, 2020

If you want the best of both worlds, then try VictoriaMetrics. It is simple to deploy and operate and it is more resource-efficient comparing to InfluxDB. More on this, it supports data ingestion over Influx line protocol [1].

[1] https://victoriametrics.github.io/#how-to-send-data-from-inf...

heliodor · on July 21, 2020

For small and medium needs, InfluxDB is a delight to use. On top of that, the data exploration tool within Chronograf is one-of-a-kind when it comes to exploring your metrics.

If you are looking for a vendor to host, manage, and provide attentive engineering support, check out my company: https://HostedMetrics.com

a10c · on July 22, 2020

One of the biggest quirks that I had bumped up against with TimescaleDB is that it's backed by a relational database.

We are a company that ingests around 90M datapoints per minute across an engineering org of around 4,000 developers. How do we scale a timeseries solution that requires an upfront schema to be defined? What if a developer wants to add a new dimension to their metrics, would that require us to perform an online table migration? Does using JSONB as a field type allow for all the desirable properties that a first-class column would?

valyala · on July 22, 2020

90M datapoints per minute means 90M/60=1.5M datapoints per second. Such amounts of data may be easily handled by specialized time series databases even in a single-node setup [1].

> What if a developer wants to add a new dimension to their metrics, would that require us to perform an online table migration?

Specialized time series databases usually don't need defining any schema upfront - just ingest metrics with new dimensions (labels) whenever you wish. I'm unsure whether this works with TimescaleDB.

[1] https://victoriametrics.github.io/CaseStudies.html

mekster · on July 22, 2020

Can you refute the claim that TimescaleDB uses such a huge disk space that is 50 times larger than other time series database?

I see no reason to use it over VictoriaMetrics.

https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...

PaulWaldman · on July 22, 2020

Yes, I can refute that claim. TimescaleDB now provides built in compression.

https://docs.timescale.com/latest/using-timescaledb/compress...

mekster · on July 22, 2020

You didn't refute it technically.

Are you saying the compression shrinks the data down to 2% on average?

If the compression only makes the data 10 times smaller (I think I'm being generous with that ratio), it's still 5 times larger than the others.

PaulWaldman · on July 22, 2020

Completely anecdotal, but I went from 50B to 2B per point using the compression feature. Mind you this is for slow moving sensor data collected at regular 1 second intervals.

Prior to the compression feature, I had the same complaint. Timescale strongly advocated for using ZFS disk compression if compression was really required. Requiring ZFS disk compression wasn't feasible for me.

aeyes · on July 21, 2020

I don't consider TimescaleDB to be a serious contender as long as I need a 2000 line script to install functions and views to have something essential for time-series data like dimensions:

https://github.com/timescale/timescale-prometheus/blob/maste...

mfreed · on July 21, 2020

Because TimescaleDB is a relational database for general-purpose time-series data, it can be used in a wide variety of settings.

That configuration you cite isn’t for the core TimescaleDB time-series database or internals, but to add a specific configuration and setup optimized for Prometheus data, so that users don’t need to think about it and TimescaleDB can work with the specific Prometheus data model right out of the box.

Those “scripts” automatically setup a deployment that is optimized specifically to have TimescaleDB serve as a remote backend for Prometheus without making users think about anything. Some databases would need to write a lot of specific internal code for this; TimescaleDB can enable this flexibility through configuring proper schemas, indexes, and views instead (in addition to some specific optimizations we've done to support PromQL pushdown optimizations in-database written in Rust).

This setup also isn't something that users think about. Users install our Prometheus stack with a simple helm command.

$ helm install timescale/timescale-observability

The Timescale-Prometheus connector (written in Go) sets this all up automatically and transparently to users. All you are seeing is the result of the system being open-source rather than having various close-source configuration details :)

- https://github.com/timescale/timescale-observability

- https://github.com/timescale/timescale-prometheus

cevian · on July 21, 2020

(TimescaleDB engineer here). This is a really unfair critique.

The project you cite is super-optimized for the Prometheus use-case and data model. TimescaleDB beats InfluxDB on performance even without these optimization. It's also not possible to optimize in this way in most other time-series databases.

These scripts also work hard to give users a UIUX experience that mimicks PromQL in a lot of ways. This isn't necessary for most projects and is a very specialized use-case. That takes up a lot of the 2000 lines you are talking about.

aeyes · on July 21, 2020

> TimescaleDB beats InfluxDB on performance even without these optimization

Would you mind sharing the schema used for this comparison? Maybe I missed it in your documentation of use-cases. When implementing dynamic tags in my own model, my tests showed that your approach is very necessary.

cevian · on July 21, 2020

You can look at what we use in our benchmarking tool https://github.com/timescale/tsbs (results described here https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...).

Pretty much it's a table with time, value, tags_id. Where the tags table is id, jsonb

R0b0t1 · on July 21, 2020

>The project you cite is super-optimized for the Prometheus use-case and data model.

That may be true, but: Instead of figuring out how to meet the user's needs you're going to say the user is wrong?

NicoJuicy · on July 21, 2020

That's not what he said and is kinda rude to be honest.

What I consider to be said is that they optimize for extendability ( not just one use-case) and that those 2000 lines do a lot of things.

Including replicating optimizations that are are not easy to achieve out of the box in their and other solutions. But they mostly found a way ;)

R0b0t1 · on July 23, 2020

It's about as rude as telling someone their criticism is misplaced in that way. You've added more words to why you believe that, but that doesn't change the original sentiment.

You can say you don't want to support that usecase but someone inquired about whether you would. You said no, they're wrong to compare the two. Sounds a lot like "you're holding it wrong."

gen220 · on July 21, 2020

(Not affiliated with TimescaleDB, just trying to understand your critique)

TimescaleDB is a Postgres extension, which means they have to play PG's rules, which rather implies a complicated mesh of functions, views, and "internal-use-only" tables.

But once they're there, you can pretty much pretend they don't exist (until they break, of course, but this is true for everything in your software stack).

Is your complaint targeting the size of these scripts, or their contents? Because everything in there appears pretty reasonable to me.

aeyes · on July 21, 2020

My problem is what these scripts do. When working with timeseries you want to be able to filter on dimensions/tags/labels.

TimescaleDB doesn't support this out of the box, the reason being Postgres' limitations on multi column indexes when you have data types like jsonb. What they have to do to work around this is to have a mapping table from dimensions/tags/labels which are stored in a jsonb field to integers which they then use to look the data up in the metrics table.

I would expect that a timeseries database is optimized for this type of lookup without hacks involving two tables.

cevian · on July 21, 2020

(TimescaleDB engineer) Actually, the reason we break out the Jsonb like this is because json blobs are big and take up a lot of space and so we normalize it out. This is the same thing that InfluxDB or Prometheus do under the hood. We also do this under the hood for Timescale-Prometheus.

If you have a data model that is slightly more fixed than Prometheus, you'd create a column per tag and have a foreign-key to the labels table. That's a pretty simple schema which would avoid a whole lot of complexity in here. So if you are developing most apps this isn't a problem and filtering on dimensions/tags/labels isn't a problem. Again, this is only an issue of you have very dynamic user-defined tags, as Prometheus does. In which case you can just use Timescale-Prometheus.

gen220 · on July 21, 2020

I'll echo the sibling comment by the TDB engineer, having read and contributed to similar systems myself.

Most of these custom databases (Influx) use techniques similar to TDB's under the hood, in order to persist the information in an read/write optimized way; they just don't "expose" it as clearly to the end-user, because they aren't offering a relational database.

A critical difference between the two, is that custom databases don't benefit from the fine-tuned performance characteristics of Postgres. So even though they're doing the same thing, they'll generally be doing it slower.

klohto · on July 21, 2020

Which kinda confirms the parent’s issues... InfluxDB is really easy to deploy and forget. With TimescaleDB you should be ready to know ins-and-outs of PG to secure and maintain correctly. Sure, for scaling and high loads TDB might be good but InfluxDB is easier and suitable for most loads and maintainability.

gen220 · on July 21, 2020

> With TimescaleDB you should be ready to know ins-and-outs of PG to secure and maintain correctly.

Gotcha, this makes sense. To this, I'd pose the question: is the same not true for Influx? (i.e. With IFDB, you should be ready to know its ins and outs to secure and maintain it correctly). I guess I think about choosing a database like buying a house. I want it to be as good in X years as it is today, maybe better.

From this perspective, PG has been around for a long time, and will continue to be around for a long time. When it comes time to make tweaks to your 5 year old database system, it will be easier to find engineers with PG experience than Influx experience. Not to mention all of the security flaws that have been uncovered and solved by the PG community, that will need to be retrodden by InfluxDB etc.

Anyways, it's just an opinion, and it's good to have a diversity of them. FWIW, I think your perspective is totally valid. It's always interesting to find points where reasonable people differ.

Legogris · on July 21, 2020

Conversely, because of its larger scope and longer history, there are a lot more ins and outs for PG.

mhall119 · on July 21, 2020

Think if InfluxDB is "move-in ready" when buying a house. Sure you won't know the electrical and plumbing as well as you will a "fixer-uppper" in the long run, but you'll have a place to sleep a lot faster and easier.

gen220 · on July 21, 2020

I think it's a good analogy; very reasonable people come to different conclusions on this problem IRL too.

I prefer older houses (at least pre-80s). They have problems that are very predictable and minor. If they had major problems, they would show them quite clearly in 2020.

My friends prefer newer properties, but this is an act of faith, that the builders today have built something that won't have severe structural/electrical/plumbing problems in 20 years.

Of course, if you only plan on living some place for 3-5 years, none of this matters; you want to live where you're comfortable.

kermatt · on July 21, 2020

There may be some extra install steps with Timescale, but that is balanced by having data in a platform accessible by a very well known API.

There are always tradeoffs, but in cases like mine, I find the basis of the platform - PostgreSQL - to provide a very useful set of features.

doublesCs · on July 21, 2020

[flagged]

aeyes · on July 21, 2020

The linked scripts are not part of the base installation yet TimescaleDB claims, for example, Prometheus support.

If you design your own schema and want to filter on such a dynamic tag field you have to replicate what they have implemented in these scripts. If you look at the code you'll see that this isn't exactly easy and other products do this out of the box.

I use TimescaleDB and have suffered from this, otherwise I wouldn't be talking here. Another problem is that these scripts are not compatible with all versions of TimescaleDB so you have to be careful when updating either one.

thibauts · on July 21, 2020

I’m really sad to see this kind of tone more and more in HN comments...

doublesCs · on July 21, 2020

I find it sadder that people talk such glaring nonsense and are taken seriously. Extremely sad.

gen220 · on July 21, 2020

IMO, the HN guidelines are pretty clear and reasonable regarding these situations.

> Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

The original poster was making an assertion that TDB is too complex. As the guidelines suggest, it doesn't help the conversation to assume that they arrived upon this conclusion unreasonably, and that they're irrational or otherwise unwilling to have an open conversation about it.

We have to assume that they are open to a discussion, despite whatever we might interpret as evidence to the contrary.

say_it_as_it_is · on July 21, 2020

This thread is about infra monitoring, where any of the potential performance differences probably don't matter at all. If it does, show us.

I don't work for any time series database provider.

darkwater · on July 21, 2020

Why performances shouldn't matter in this case? The more instances, the more timeseries and the more datapoints you have, the more you care for speed when you need to query it to visualize for example WoW changes in, say, memory usage.

say_it_as_it_is · on July 21, 2020

Consider the case at hand, involving streaming charts and alerts. There will be zero perceptible difference in the streaming charts regardless of what database is used. Alerts won't trigger for whatever millisecond difference there may be, and I don't think that this matters to any developer or manager awaiting the 3am call.

brown9-2 · on July 22, 2020

Time series database performance is not only about the reads.

pritambaral · on July 21, 2020

Obligatory heads-up: parts of TimescaleDB (including the multi-node feature) come with no right-to-repair. See https://news.ycombinator.com/item?id=23274509 for more details.

akulkarni · on July 21, 2020

We are currently working on revising this to make the license more open. Stay tuned :-).

(If you want to provide any early feedback, please email me at ajay (at) timescale.com)

ngrilly · on July 21, 2020

TimescaleDB and my team is using it, but one significant drawback compared to solutions like Prometheus are the limitations of continuous aggregations (basically no joins, no order by, no window functions). That’s a problem when you want to consolidate old data.

cevian · on July 21, 2020

(TimescaleDB engineer) we hear you and are working on making continuous aggregations easier to use.

For now, the recommended approach is to perform continuous_aggregates on single tables and perform joins, order by, and window when querying the materialized aggregate rather than when materializing.

This often has the added benefit of often making the materialization more general so that a wider range of queries can use it.

ngrilly · on July 22, 2020

Thanks for the advice! Makes sense. We are doing something similar (aggregate on single table and join later). But still looking a solution to compute aggregated increments when there are counter resets.

ngrilly · on July 22, 2020

Meant "TimescaleDB is great and my team is using it"...

didip · on July 21, 2020

How does multi node works with Postgres? Does TimescaleDB create its own Raft layer on top and just treat Postgres as dumb storage?

k-rus · on July 21, 2020

A TimescaleDB engineer here. Current implementation of database distribution in TimescaleDB is centralised where all traffics go through an access node, which distributes the load into data nodes. The implementation uses 2PC. Abilities of PostgreSQL to generate distributed query plans are utilised together with TimescaleDB optimisations. So PostgreSQL is used not just a dumb storage :)

didip · on July 21, 2020

Won't that access node be a single point of failure then? Just trying to learn more.

cevian · on July 21, 2020

The access node is replicated using streaming replication and is thus not a SPOF.

Coryodaniel · on July 21, 2020

> high-cardinality workloads

Does this mean if you were using it with Prometheus you could get around issues with high cardinality labels?

_wka6 · on July 21, 2020

We're using TimescaleDB + Grafana for visualising sensor data for a Health Tech product. No complaints so far.

znpy · on July 21, 2020

After a few years in the industry of systems engineering and administration I think that "no complaints so far" is one of the best compliments a software can receive.

linsomniac · on July 21, 2020

I've been doing monitoring of our ~120ish machines for 3-4 years now using Influx+Telegraf+Grafana, and have been really happy with it. Prior to that we were using collectd+graphite and with 1 minute stats it was adding some double-digits %age utilization on our infrastructure (I don't remember exactly how much, but I want to say 30% CPU+disk).

Influxdb has been a real workhorse. We suffered through some of their early issues, but since then it's been extremely solid. It just runs, is space efficient, and very robust.

We almost went with Prometheus instead of Influx (as I said, early growing pains), but I had just struggled through managing a central inventory and hated it, so I really wanted a push rather than pull architecture. But from my limited playing with it, Prometheus seemed solid.

rehevkor5 · on July 22, 2020

It's so much easier to write incorrect/misleading queries in influxql than in promql. And you can't perform operations between two different series names in influxdb, last I looked. That makes it impossible to do things like ratios or percentages unless you have control over your metrics, and structure them the way influx likes. Also, no support for calculating percentiles from Prometheus histogram buckets.

mhall119 · on July 22, 2020

You can do these things (and much, much more) with the new Flux language

drez · on July 21, 2020

I just wanted to echo this sentiment. Influx had its fair share of issues a few years ago (v0.8 migration, changing storage engines, tag cardinality issues), but the latest v1.x releases have been solid. I have been using the TIK stack (I use Grafana instead of Chronograf) for monitoring several dozen production-facing machines for 2 years now without a single issue, which I would very much count as a win.

I just hope they learned their lesson for the v2.0 release...

valyala · on July 22, 2020

> I really wanted a push rather than pull architecture

Then try VictoriaMetrics - it supports both pull and push (including Influx line protocol) [1], it works out of the box and it requires lower amounts of CPU and RAM when working with big number of time series (aka high cardinality) [2].

[1] https://victoriametrics.github.io/#how-to-import-time-series...

[2] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

carlosdp · on July 21, 2020

Gotta say though, having rolled grafana and prometheus and such on my own plenty of times before, if you are a startup and can afford Datadog, use Datadog.

dijit · on July 21, 2020

I just costed a datadog deployment (based on your comment) and it would cost me my yearly salary every month.

No thanks. :/

secondcoming · on July 21, 2020

We're moving from Datadog to Prometheus for this reason

cs-szazz · on July 21, 2020

Wait, you're saying it costs 12x as much as your salary? That doesn't seem right... the company I'm at uses it pretty heavily and we're at about 1x of a FTE salary (and it's still super worth it)

klohto · on July 21, 2020

World’s salaries aren’t only US based :) Datadog for a smaller Western Europe startup is going to cost more than their devs salaries.

user5994461 · on July 21, 2020

Western salaries are not that low and true costs to the employer is typically double the perceived salary. That means we're talking tens of thousands of euros, so thousands of hosts (list price is certainly negotiable at this scale).

If they've got a thousand of hosts, the costs of the infrastructure itself must dwarf the salary of any developer by orders of magnitude, the salary of a developer is simply irrelevant when it comes to acquiring software/hardware.

dijit · on July 21, 2020

> true costs to the employer is typically double the perceived salary.

This is true, but my comment was an offhanded way to say that my "salary" (as in, the one on my contracts and the one I "see") is less than a month of Datadog for our number of hosts.

As for the rest of your comment, I wish it was true.

Developer salaries outside of the capitals is quite low in Europe, and even inside the capitals only go to "near double"

So, instead of 12x it becomes 6x developer costs per annum, which is a fair whack of money.

For me to justify spending "3-6" peoples worth of money it had better save "3-6" peoples worth of time.

user5994461 · on July 21, 2020

> For me to justify spending "3-6" peoples worth of money it had better save "3-6" peoples worth of time.

Well, it does in my experience, especially if you have to handle 2000+ hosts, that's some serious infra there, need serious tooling.

May I ask which country is it?

aprdm · on July 22, 2020

I think another thing to keep in mind is that not necessarily hardware infrastructure size equals to profitability.

Some industries need a lot of hardware because they crunch a lot of data but they aren't software companies. Think Computer graphics rendering.

Paying a FTE salary for software is crazy for them. I would love to see a ration of developers/infrastructure per industry/company.

dijit · on July 21, 2020

Sweden! :D

cs-szazz · on July 21, 2020

You're right, I was narrow-minded in my thinking there, thanks for pointing it out :)

rattray · on July 21, 2020

Can you say something vague about your deployment scale?

dijit · on July 21, 2020

2,500~ compute instances.

Looking at the pricing page the cost $15/instance so $37,500

Middle-of-the-road developer salary is like $35k in most of Europe, outside of the capitals.

Although it does say: "Volume discounts available (500+ hosts/mo). Contact us." at the bottom, so I guess 500 is a lot.

user5994461 · on July 21, 2020

2500 instances could be millions per months in AWS costs. The smallest instances with some disks and bandwidth fees can push 100k a month.

Spending a fraction of that to monitor that sort of infrastructure is absolutely justified. I can tell you from experience that datadog gives discount even for 100+ hosts, I don't know what they can do for 2500, but if it were me I wouldn't accept anything less than 50% off.

Honestly you need to forgot about your salary, it's irrelevant when it comes to running a company. Imagine a driver in a shipping company deciding to deliver on a scooter rather than a truck because the truck is worth more than his yearly salary.

sciurus · on July 21, 2020

It could also be less than $100k a month (e.g. 2500 c5a.large with a 1-year reservation). At that point you'd wonder why your monitoring bill was 40% of your compute bill.

Also, of course their salary is relevant. The cost of an engineer's time is an important factor to consider when making build vs buy decisions. Usually it's one that argues in favor of "buy", but not always.

user5994461 · on July 22, 2020

2500 c5a.large is $172k per month, or $102k with one year reservation full upfront.

Add 10% support fees, EBS storage for the OS, bandwidth fees and it's quite a bit more.

dijit · on July 21, 2020

It's not millions, but it's the many multiples of hundreds of thousands. (and it's mainly GCP/bare metal).

I guess the point I am driving at here is that there's such a thing as "business critical costs" (IE: can we ship our product or not) which is the majority of infra costs we have today, and then there's "optimisation costs".

Usually when we discuss things like optimisation costs its along the lines of: "Will this product save us enough time to justify it's expense". Often, sadly, the answer is no.

Terraform Enterprise is an example of a time where we said: Yes. -- because the API allows us to deploy CI/CD jobs which provision little versions of our infrastructure, saving us many man-days of time in provisioning and testing every year.

As eluded to in the sibling thread, there's almost no way that we can save 3 or more peoples worth of time every year, we're 3 people right now and we have metrics collection, log tracing and alerting already. -- so it's a hard sell to the business types.

user5994461 · on July 21, 2020

Great choice on GCP! It's probably half of the price as AWS for the same thing.

Monitoring and logging are business critical. It's an integral part of infrastructure and it is very normal to spend 10% there. It's really not possible to operate stably and efficiently at a large scale like that without a trove of tooling.

Tools usually justify their costs by allowing to optimize the infra and helping to prevent/fix outages, though not all companies care about stability or hardware costs.

And it's not a choice of free vs paid. open source software costs a lot of money too, pairs of large instances to run it don't come cheap, they're probably more than a salary too if the company wants to have any sort of redundancy or geographic distribution.

May I ask what do you have for logging? I guess you must be screaming in horror at the price of elasticsearch/kibana/splunk :D

dijit · on July 21, 2020

The community versions of ELK, Zabbix, InfluxDB and grafana.

Zabbix is the weak link here for sure, but the monitoring is quite comprehensive.

aprdm · on July 22, 2020

Or at that point you can have a datacenter/colo and not spend anywhere near what you're thinking :)

Axsuul · on July 21, 2020

As a counterpoint: I have been rolling my own Grafana/Prometheus for many years for my startup and it's been pretty trivial

risyachka · on July 21, 2020

Same here. DataDog would cost me at least 10x.

yellow_mixer · on July 21, 2020

How are you running it? On AWS? What’s your infra look like.

Axsuul · on July 21, 2020

GCP with Docker Swarm, scheduled on the same node (n2-standard-4). Let me know if any other questions

carlosdp · on July 21, 2020

Does "scheduled on the same node" imply you are running everything on one node?

Axsuul · on July 21, 2020

Nope, just that Prometheus and Grafana are running together on the same node. My entire infrastructure is multi-node however.

yellow_mixer · on July 21, 2020

How much data are you sending and ingesting?

Axsuul · on July 22, 2020

I don't have any hard numbers but quite a bit. Everyday we process millions of background jobs, thousands of database queries per second, etc. and stats are collected and sent to Prometheus for all those.

ciguy · on July 21, 2020

This is good advice, however you will also want to make sure you have a plan to get off datadog when you grow. Datadog is one of the easiest to use and most comprehensive out of the box. But it gets really expensive as you begin to scale up and add servers, cloud accounts, services etc....

At a certain scale, rolling your own monitoring and alerting becomes cost effective again as Datadog begins to charge an arm and a leg. I've seen Datadog bills that could easily pay for 2 full time engineers.

user5994461 · on July 21, 2020

> I've seen Datadog bills that could easily pay for 2 full time engineers.

Which means the companies are running thousands of hosts. So they definitely need both datadog and a full team of sysadmin/devops/SRE to handle that infrastructure.

carlosdp · on July 21, 2020

On the contrary, the bigger you get, the more you need Datadog to scale. Once you grow big enough, you can make volume deals with them, you don't pay list price. I'd much rather pay them than have a single engineer have to spend any time managing infrastructure that isn't core to our product, especially since Datadog will always do it better.

ciguy · on July 22, 2020

I respectfully strongly disagree. At a certain scale when you can afford to have a full-time engineer working on it, open source tooling will give you a better, cheaper and more flexible solution since it's easy to customize. Datadog is good for the mainstream cases but it's not all that flexible for unusual or edge cases.

lima · on July 21, 2020

We migrated customers off DataDog for this exact reason.

Rolling a custom Prometheus / Grafana / Alertmanager setup is not hard at all, more powerful, and it's much easier to do it right from the start.

jakozaur · on July 21, 2020

DataDog per host pricing can be very expensive. Metrics are provided by many platforms. If you need logs too, you may look at Sumo Logic which got way cheaper metrics in typical use case.

Disclaimer: I work at Sumo Logic.

carlosdp · on July 21, 2020

It's expensive for good reason, it's the best out there.

deepakhj · on July 21, 2020

That's debatable. I prefer Grafana.

bpodgursky · on July 21, 2020

Like any SaaS dev tool, when at scale, you negotiate and pay a fraction of the list price.

It's meaningless to look at the price of Datadog@5 hosts -- at 500 or 5000, you're paying a completely detached number from the website list price, likely a small fraction.

sciurus · on July 21, 2020

Based on my experience at two companies with thousands of hosts on Datadog, this is not true. You'll only be able to negotiate a small discount.

You'll also experience their habit of launching new features, waiting a while for customers to adopt them, then starting to charge extra for them.

Don't get me wrong, Datadog has great products. But they're also great at extracting money from their customers.

freehunter · on July 21, 2020

Which blows away the reason a lot of people/teams/companies like SaaS. Because there is no negotiation, no sales requisitions, no long lead time while they come up with a quote. You see the price you pay the price you get the service, same as anyone else.

cloudpundit · on July 23, 2020

Every SaaS provider offers discounts based on length of commitment and volume of spend. Most will say look at AWS, but even there you have list prices and private/bulk pricing. Throw in EDPs, negotiated credits, savings plans, reservations, etc and you're nowhere near list prices.

scaryclam · on July 22, 2020

This is not my experience at all. We've got it setup and monitoring all sorts of things. After the initial setup, the only reason we've needed to touch it is when we've introduced new things and wanted to update the config. In fact, data dog was far more of a pain, and far less useful than prometheus has been.

Hates_ · on July 21, 2020

I'd be interested to know a bit of detail as we're looking into Grafana/Promethus a bit.

dharmab · on July 21, 2020

Prometheus is a very needy child in terms of data volume and hardware resources. Running it is at least one engineers' full time job- if you're a startup, you can outsource monitoring for a tiny fraction of the price, then move to Prometheus later if you are successful.

lima · on July 21, 2020

Please elaborate - how is it one engineer's full time job?

We run Prometheus in production and this hasn't been our experience at all.

A single machine can easily handle hundreds of thousands of time series, performance is good, and maintaining the alerting rules is a shared responsibility for the entire team (as it should be).

gen220 · on July 21, 2020

I think the parent's complaint is a function of how your engineering org "uses" prometheus.

If you use it as a store for all time series data generated by your business, and you want to have indefinite or very-long-term storage, managing prometheus does become a challenge. (hence m3, chronosphere, endless other companies and tech built to scale the backend of prometheus).

IMO, this is a misuse of the technology, but a lot of unicorn startups have invested a lot of engineering resources into using it this way. And a lot of new companies are using it this way; hence the "one engineer's FT job".

dharmab · on July 21, 2020

I'd agree; I'm at a large corp that has a need to store our data for a very long term. If we were using Prometheus as an ephemeral/short term TSDB to drive alerting only, it would be really easy.

user5994461 · on July 21, 2020

For reference: A single machine can push 5k metrics so you're saying a single prometheus instance can easily serve 20s of hosts. lol

sevagh · on July 21, 2020

False in my experience. Full-time job? After the initial learning curve, a simple 2x redundant Prometheus poller setup on can last for a long time. Ours lasted for 30,000,000 timeseries until encountering performance issues.

After that, we needed some more effort to scale out horizontally with Thanos, but again, once it's set up, it maintains itself.

edoceo · on July 21, 2020

My company's Prometheus setup was super easy, one $10/mo box. About 1 week of fiddling all the exporters and configs but now it just runs and has for months.

You can be small with Prometheus and grow into needing an FTE for it - w/o having the migration hurdle of moving out-source to in-source

pachico · on July 21, 2020

I approached InfluxDB since it looked promising. It did actually served its purpose when it was simple and Telegraf was indeed handy. Now that I have more mature requirements I can't wait to move away from it. It gets frozen frequently, it's UI Chronograph is really rubbish, functions are very limited and managing continuous queries is tiresome.

I'm now having better results and experience storing data in ClickHouse (yes, not a timeseries dB).

From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.

It might just be me.

willvarfar · on July 21, 2020

I have also had scalability and reliability issues with influx. And full of silly limitations like tagset cardinality and not being able to delete points in a specific retention policy etc. Am moving to classic rdbms and timescaledb.

pachico · on July 21, 2020

Indeed, cardinality limit is a very painful aspect which blocked us since day one for certain metrics. I confess, at the time I didn't know any better. Now I wouldn't recommend it under any circumstances.

mhall119 · on July 21, 2020

> From time to time I also follow what's coming in InfluxDB 2.0 but I must confess that 16 betas in 8 months are not very promising.

Don't read too much into that, it's more a result of wanting to get testable releases out early and temporarily redirecting engineering resources to take advantage of opportunities that arose in that timeframe than anything having to do with the code of 2.0 itself.

Our Cloud 2 SaaS offering is already running the 2.0 code in production (albeit with changes to support it being deployed as a massively multi-tenant service)

pachico · on July 21, 2020

Fair enough. Good luck then with 2.0. I just wish the value proposition is strong enough to inspire users like me to try it again although much has to change for that to happen. I just wish I didn't have the feeling that the free version is just a bait for the SaaS. Anyhow, as said, good luck.

mhall119 · on July 21, 2020

The OSS version is going to fill a vital role that SaaS just can't do, and we view it as a critical component in what InfluxDB provides. I don't think we're ever going to get away from needing on-prem and on-device deployments when dealing with time-series data, especially for emerging IoT/Edge use cases.

No, the company recognizes that the success of the SaaS is inevitably tied to the success of the OSS product. The only reason we were able to give the SaaS more focus recently is because we already had a working OSS product in the 1.x line that was meeting the needs of existing customers that we were and are continuing to invest in.

pachico · on July 21, 2020

I must say that if scalability is not part of OSS then it inevitably smells fishy to me. I am willing to pay to avoid the pain of maintaining or designing clusters but not because it's the only way to scale up. There are too many options that allow to scale with OSS version now it's hard to justify it. It's still a no-go for me.

Again, just take it as one person's opinion. I can't even grasp how complicated and challenging is to run what you offer. This is strictly user's opinion.

gregwebs · on July 21, 2020

Have you tried Victoria Metrics? It originallyl started as a ClickHouse fork for time series but is now a re-write keeping some of the same principals.

pachico · on July 21, 2020

Yes, we use it as long term retention for Prometheus. I was also tempted to use it as stand-alone dB but never got to it. Do you use it as such?

overcast · on July 21, 2020

Really, REALLY tried to love InfluxDB. But its systems requirements, performance, and features are poor compared to things like TimescaleDB.

candiddevmike · on July 21, 2020

Influx is really shitting the bed with how they're handling the InfluxDB 2.0 release. The docs are a mess, and the migration tool seems like the result of a weekend hackathon. They're leaving a lot of customers with long term metrics in a tough spot.

If you're thinking about using Influx for long term data storage, look elsewhere. The company continuously burns customer goodwill by going against the grain, and bucking the ecosystem by trying to control the entire stack.

rbetts · on July 21, 2020

(I'm the VPE at Influxdata).

I appreciate this sentiment. We've been focused on a building a SaaS version of Influxdata and are committed to a paired open source version of that. The open source version has been lagging as we work on the SaaS side.

However, we are committed to shipping a GA version of the OSS 2.0 stack around the end of Q3 that offers an in-place data migration capability from 1.x OSS.

We've spoken about this publicly in other forums. You can google "influxdays London talks" to hear Paul Dix (CTO/Founder) talk more about our OSS plans.

edoceo · on July 21, 2020

Classic "two masters" problem.

santafen · on July 21, 2020

This was from the CTO last week: "This work won't be landing anywhere until sometime next year and it'll be landing in our Cloud 2 offering first." So the OSS is definitely a second-class citizen. And now that they've dropped all DevRel activity, don't expect much attention for OSS Developers and users.

mhall119 · on July 21, 2020

Quite the contrary, we're finishing up packaging of the OSS version now to make it as easy as possible for you to get it and we're assembled a new team focused on getting OSS 2.0 released.

We continue to support all users, including OSS users, in our public Slack and Discourse, as well as Github. We have not "dropped all DevRel" activity.

santafen · on July 21, 2020

¯\_(ツ)_/¯ Influx laid off pretty much the entire DevRel team in the last 6 months, but ok.

mhall119 · on July 21, 2020

This is true, and has certainly been painful for us and puts more work on those of us who are still working on community and devrel, but we're still doing it.

mhall119 · on July 21, 2020

In fact we've got one of our Developer Advocates doing a live broadcast right now over at https://www.influxdata.com/time-series-meetup/virtual-2020

rattray · on July 21, 2020

Well, guess I know to never use Influx as a startup

mhall119 · on July 21, 2020

Why is that? It takes minimal effort to get it up and running and you can either self-host or use the SaaS offering on any of the major clouds. There's even a free tier on the SaaS your startup can use that won't cost you a dime until your usage becomes significant.

rattray · on July 21, 2020

Because the VPE sounds like the corporate sort of person who would not prioritize the things I'd want them to as a customer.

rbetts · on July 21, 2020

That kinda hurts.

There are 100k's of happy InfluxDB users -- untold millions of completely open source Telegraf deployments doing meaningful work for people -- integrated into our products and our competitors' products. We make the vast majority of our codebase public under liberal OSS licenses.

And we'll keep listening and learning how to do more.

... off to see if I can change my internal slack handle to @corporatetool. Unless there's a policy against that ;-)

rattray · on July 22, 2020

Hey, yeah, sorry, I've been meaning all day to go back and fix that. Really wasn't right.

Just struck a nerve that reminded me of some leaders at a previous gig who made things a little less sensible.

I apologise, and you definitely don't deserve it :-(

mhall119 · on July 21, 2020

I don't think anybody who knows Ryan would describe him as "the corporate sort of person" :)

He was giving you an honest assessment of what was going on, not sugar-coated or wrapped in corporate speak. The work he described on our SaaS offering was directly tied to what our current and potential customers wanted.

threeseed · on July 21, 2020

Your issue is that they are prioritizing paying customers over giving you a free product.

I don't think that's fair given they are still a startup.

mhall119 · on July 21, 2020

Just to add to what Ryan already said, we don't plan on dropping support for InfluxDB 1.x anytime soon, and will continue to improve the migration path from 1.x to 2.0, so you won't have to upgrade until it's right for you.

Now of course our goal is to help everyone upgrade to 2.0 and beyond, but we know that we made a lot of changes and improvements in 2.0, this isn't a minor upgrade. We will focus first on what we need to do to help 80% of our users upgraded, then the next 80%, and so on, until we've got you all covered.

Meanwhile new users can benefit from starting off on 2.0 as soon as it's available (or get the beta which is already out) and not have to wait.

(Source: I'm the Community Manager for InfluxData)

goliatone · on July 21, 2020

Im on the same boat, I tried very hard. Went through a few version upgrades with data incompatibility and other issues. Used it for personal projects as well as for different projects at work. I truly regret ever recommending it to different teams at work.

I ended up moving away from it to TimescaleDB and I’m pretty happy so far.

majkinetor · on July 21, 2020

I use InfluxDb on number of big gov production services, and never had a single issue with it. I use it for all metrics (applicative and infrastructure).

Its easy to deploy and its x-platform. Not sure what are all the comments here about - maybe for really huge loads which I don't have experience at, I usually use separate influxdb per service.

mbell · on July 21, 2020

We had major issues with scaling InfluxDB. We use clickhouse (graphite table engine) now and it is more than order of magnitude more resource efficient.

julioo · on July 21, 2020

What about storage? We are running influxdb and we are looking for alternative. But a point where Influx is good is storage.

k-rus · on July 21, 2020

TimescaleDB provides such community features as compression, which allows to save space a lot, and continuous aggregates, which gives performance and save space if used together with retentions.

mekster · on July 22, 2020

Tried VictoriaMetrics?

The author compares with other storage engines comprehensively from time to time.

https://medium.com/@valyala

tgtweak · on July 21, 2020

You mean storage efficiency? Seems that unless the aggregation is happening before being stored, it would be unfair to compare influxdb with other databases that are tasked with storing per-record granularity.

RedShift1 · on July 21, 2020

InfluxDB is pretty good if you don't need to do any advanced querying like grouping by month or formulas. Its strong points are low diskspace footprint and very fast queries even over long periods of time.

mhall119 · on July 21, 2020

Have you looked at Flux? You can do some really incredible things with it, both inside of queries and also in tasks/alerts. Check out https://www.influxdata.com/blog/anomaly-detection-with-media... for an example

RedShift1 · on July 22, 2020

Yes I have looked at it and it's unusable for my application because it is orders of magnitude slower than InfluxQL. Fairly logical if you think about how Flux works, it's very hard to optimize a query engine with a language where every step is supposed to be discrete.

mhall119 · on July 22, 2020

There are actually multiple optimizations to Flux already underway. The language runtime can "push down" some of these operations to the storage layer where they can be performed more efficiently.

svd4anything · on July 21, 2020

Is that really true? Do you have any reference for that, I’m curious how different it is. FWIW I use InfluxDB and haven’t really thought it was a huge resource hog.

valyala · on July 26, 2020

InfluxDB may require big amounts of RAM when working with high cardinality data. It works OK with up to a million of time series. Then it starts eating RAM as crazy [1]. How many time series does your InfluxDB setup contain?

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl... .

viraptor · on July 21, 2020

Does any of those matter for small-scale monitoring (say <= 10 hosts)? I've got influx sitting pretty much idle in that kind of environments and any data preprocessing / collation in grafana works just fine.

dahfizz · on July 21, 2020

Arguably system requirements matter more for small scale deployments. I shouldn't need a server with multiple SSD volumes and 8GB+ ram just to monitor a couple raspberry pi's.

EDIT: okay, I get it. You don't need 8GB ram. I was just going by the hardware requirements in their docs: https://docs.influxdata.com/influxdb/v1.8/introduction/insta...

viraptor · on July 21, 2020

... and it doesn't need that. I'm running telegraf+influx+grafana along with other services on a small rpi without issues.

alyandon · on July 21, 2020

Running TICK stack here collecting metrics from around a dozen different hosts (more if you count network devices like switches) to a central server with a default retention policy on spinning rust. InfluxDB process is currently sitting at 600MB RSS.

cbzbc · on July 21, 2020

You don't need 8GB ram to run a small scale deployment of influxdb.

dspillett · on July 21, 2020

I'm sure this is true, but the documentation does explicitly say "Each machine should have a minimum of 8GB RAM" with absolutely no caveats about scale, so I think people can be forgiven for taking a quick look and moving on because they don't want to provision that.

mhall119 · on July 21, 2020

That's a valid point, I've brought it up internally to improve the docs to explain what type of load this requirement is expecting. If you'd like you can create an issue for this in Github (https://github.com/influxdata/docs.influxdata.com/issues/new) to be involved in the discussion and change.

mhall119 · on July 21, 2020

While we're working on this page, also check out https://docs.influxdata.com/influxdb/v1.8/guides/hardware_si... which gives a better breakdown of resources requirements based on expected load

overcast · on July 21, 2020

Sure, if you're really not doing anything with it, then any database would work fine for you.

jakobdabo · on July 21, 2020

I still can't find any alternative to the old, RRD-based Munin. It is so simple. You want to add a new server to monitor? Just install there the node part, enable any required additional plugins (just by creating a couple of soft-links), add one-line configuration to the main server with the new node's IP address, and you are done.

Also, the aesthetics of the UX, you see all the graphs in one single page[1], no additional clicks are required - a quick glance with a slow scroll and you can see if there were any unusual things during the last day/week.

[1] - publicly available example, found by googling - https://ansible.fr/munin/agate/agate/index.html

mhall119 · on July 21, 2020

That's more steps than using InfluxDB and Telegraf.

Check out: https://github.com/influxdata/community-templates/tree/maste...

Papric0re · on July 21, 2020

[Offtopic (a bit)] Lots of you are talking about metric monitoring. But do you have recommendations when it comes to (basic) security Monitoring? I would usually go for the Elastic-Stack for that purpose, especially because Kibana offers lots of features for security monitoring. But I feel like these stacks are so big and bloated. I basically need something to monitor network traffic (Flows and off-Database retention of PCAPs) and save some security logs (I'm not intending on alerting based on logs, just for retention). But being able to have a network overview, insight into current connections (including history) is a very useful thing. Can anybody recommend something, that's maybe a bit lighter than an entire Elastic-Stack?

floren · on July 21, 2020

I think Gravwell (https://gravwell.io) might be what you're looking for--but I work for Gravwell so I may be biased! If I can be forgiven a short sales pitch, we've built a format-agnostic storage & querying system that easily handles raw packets, Netflow records (v5, v9, and IPFIX), collectd data, Windows event logs, and more. You can see some screenshots at https://www.gravwell.io/technology

We have a free tier which allows 2GB of data ingest per day (paid licenses are unlimited) which should be more than enough for capturing logs and flows. The resources needed to run Gravwell basically scale with how much data you put into it, but it's a lot quicker to install and set up than something like Elastic, in our opinion (https://www.gravwell.io/blog/gravwell-installed-in-2-minutes)

Edit: it's currently a bit roll-your-own, but we're really close to releasing Gravwell 4.0 which enables pre-packaged "kits" containing dashboards, queries, etc. for a variety of data types (Netflow, CoreDNS logs, and so on)

iudqnolq · on July 21, 2020

When you say

> Gravwell is developed and maintained by engineers expert in security and obsessed with high performance. Therefore our codebase is 100% proprietary and does not rely on open source software. We love open source, but we love our customers and their peace of mind a lot more!

does that mean you've even rolled your own webserver? Programming language?

floren · on July 21, 2020

That's... not good copy. I think it must have been written long ago. We use open-source libraries (with compatible licenses, of course) and even maintain our own set of open-source code (https://github.com/gravwell). I'll talk to the guys who maintain the website and get that fixed. Thanks for pointing it out!

Edit: We've had lots of people assume we use Elastic under the hood, so I wonder if that was just a (poorly-worded) attempt to indicate that our core storage and querying code is custom rather than some existing open-source solution.

Papric0re · on July 22, 2020

Maybe you should just wipe that paragraph completely. I get that investors like to see that you are using proprietary code, but I wouldn't expect you to be faster with that. Especially when running against Elastic, which has over 1.400 contributors currently. But you don't necessarily need to. You can get me with being focused on the right thing and not bloating your software. Lot of big projects start to loose focus and start doing everything, hence become worse doing their main job.

Especially when it comes to security, I'd like to see the lowest complexity possible. Harden your software instead of feature-fu around. That would be a good USP (I've got the feeling that no vendor has realized this so far - but customers neither did).

Papric0re · on July 21, 2020

I don't mind you giving a small sales pitch...and maybe your product is indeed what I'm searching for. But your pricing model is instantly putting me off. Same as with Splunk, you end up with not being able to predict your cost and paying way too much. Tell me when you fix that and I might be interested ;)

Edit: Sorry, I was misreading your comment. Premium is unlimited...I will look into it, thanks. :)

floren · on July 21, 2020

Yep, paying customers are licensed by the node rather than by the gigabyte (as Splunk does it), and you're really only limited by your hardware at that point. You might be surprised at how much you can accomplish on the free license, though--there are several small businesses using it to monitor their networks because 2GB/day will hold a pretty hefty amount of Netflow, collectd, Zeek, and syslog records.

mavam · on July 23, 2020

(Disclaimer: CEO & founder of Tenzir)

We at Tenzir are developing VAST for this purpose: https://github.com/tenzir/vast. It's still very early stage, but if you're up for trying something new, a lean and modern C++ architecture, BSD-license open-source style, you may want to give it a spin. The docs are over at https://docs.tenzir.com/vast.

It supports full PCAP, NetFlow, and logs from major security tools. There is CLI and Python bindings. The Apache Arrow bridge offer a high-bandwidth output path into other downstream analytics tools.

retzkek · on July 21, 2020

Maybe Loki [1] meets your needs? It lacks the analytics abilities of Elastic (e.g. what's the average response time [2]) but is much simpler to setup and use for jog aggregations, and has a pretty powerful query language for digging through and graphing log statistics (e.g. how many errors have been logged per hour). It's mainly being developed by Grafana Labs, so there's great integration in Grafana.

[1] https://grafana.com/oss/loki/

[2] I'd argue this sort of thing should be published as a metric anyways so you don't have to pull it out of the logs

Papric0re · on July 21, 2020

Tanks for the tip. But I would still need some storage and Data Shipper, right? Or is Loki also taking care of storage?

tilolebo · on July 21, 2020

Promtail is the official log shipper for Loki, but you can also use others. See https://github.com/grafana/loki/blob/master/docs/sources/cli...

As for storage, the default is BoltDB for indexes and local file system for the data, but you can also use popular cloud solutions like DynamoDB, etc. AFAIK BoltDB is automatically installed when you install Loki.

The only possible pain point I see for you is that Loki is tailored for Kubernetes. It is totally possible to use it without running a K8 cluster, but you lose some features.

user5994461 · on July 21, 2020

If you want to do logs, you can use graylog or kibana, both using elasticsearch for storage. This allows to find what was connecting where at some point in time (HTTP request logs and database connection logs).

If you want to graph connections from service to service in real time. I've actually never found anything that was capable of doing that, not even paid software.

tilolebo · on July 21, 2020

Grafana together with Loki would be a good match for you

hnarn · on July 21, 2020

Some people are probably going to throw some shade on me for saying this since it's so out of fashion but in my mind, when it comes to some types of basic monitoring (SNMP monitoring of switches/linux servers, disk space usage, backups running and handling them when they don't) then Nagios does get the job done. It's definitely olives and not candy[1] but it's stable, modular, relatively easy to configure (when you get it) and it just keeps chugging away.

If anyone has any Nagios questions, I'd be happy to answer them. I'm a Nagios masochist.

(Also, I can recommend Nagflux[2] to export performance data, metrics, to InfluxDB because noone should have to touch RRD files)

[1]: https://lambdaisland.com/blog/2019-08-07-advice-to-younger-s...

[2]: https://github.com/Griesbacher/nagflux

BurritoAlPastor · on July 21, 2020

Nagios is the Jenkins of monitoring. It's popular because you can get it running in an afternoon, and it's easy to configure by hand.

It then rots within your infrastructure, because it resists being configured any way _except_ by hand. I've built two systems for configuration-management of Nagios (at different companies), and it's an unpleasant problem to solve.

Prometheus's metric format and query syntax are cool, but the real star of the design is simply this: you don't have to restart it, or even change files on your Prometheus server, when you add or remove servers from your environment.

znpy · on July 21, 2020

I have to use an icinga instance from time to time (icinga is a nagios fork). I really can't see the value, beyond seeing if a service is up or down.

I'm surprised no one has named Zabbix. Zabbix is way better. I hadn't the chance to use Zabbix past 4.something but it's worth it.

I've been using Prometheus/grafana and frankly the value I see is it's out of the box adaptability at capturing a mutating data source (example: metrics about ephemeral pods Una kubernetes cluster).

hnarn · on July 22, 2020

> I really can't see the value, beyond seeing if a service is up or down.

This is an extreme oversimplification. The value is not in "seeing" if something is "up or down", the value is in the modularity of what a "service" can mean in the first place (anything you can script -- and the eco-system of plugins is huge), the fact that you don't have to "see" it (because notifications are extremely modular), the fact that escalations of issues can happen automatically if they are not resolved, and the fact that event-handlers in many cases can help you resolve the issue automatically without even having to raise an alert in the first place.

Nagios is a monitoring tool built with the UNIX philosophy in mind, and it's ingenious in its simplicity: decide state based on script or binary exit codes, relate dependencies between objects to avoid unnecessary troubleshooting, notify if necessary (again, with scripts/binaries) and/or try to resolve if configured. It hooks into a server frame of mind very well if you're a sysadmin.

Sure, if you main use case is "mutating data sources" and collecting metrics, any Nagios flavor won't be for you, because it's not what Nagios is made to do. There's a reason it's extremely popular in large enterprises, because it was created for them. No monitoring solution is for everyone and solves every problem.

mekster · on July 22, 2020

> you can get it running in an afternoon, and it's easy to configure by hand.

This read like a joke. Nagios looks like it's from stone age having files in cgi-bin folder with unnecessary complication to installation and management, unless they made it any better at some point.

hnarn · on July 22, 2020

While many people conflate "Nagios" with the corporate offering from the company Nagios, I personally mean the core monitoring component. There's no web interface to it (many are available, they're all ugly, but they're also not strictly necessary).

hnarn · on July 21, 2020

Yeah, I was in no way insinuating that Nagios is superior in general, or even to Prometheus, just that it does the job well for some use cases. Monitoring is tricky and you definitely need a tool box because each problem has a different optimal solution.

BurritoAlPastor · on July 21, 2020

Nagios and its forks for sure have a place in the monitoring ecosystem. They’re just not tools that tend to stick around once you’re big enough to have a dedicated DevOps or SRE team.

hnarn · on July 22, 2020

That depends completely on what type of business you're running. Anyone that has an environment that is unlikely to massively change (expansion excluded) within a few years benefits from the stability of Nagios. I know many large companies that use it, or derivatives of it, and even many state/military organizations.

egwor · on July 21, 2020

We run this at work, and I have (to put it politely) severe reservations. What does these functions for you:

- Realtime GUI which works with windows 10 (we have a web site and nagastamon) - aggregation of alerts / alert roll up - sharing filters - summary + description - temporary downtime of alerting - message rate suppression (to stop floods) - filtering of columns/ordering etc. - bulk actions for closing alerts

I've come from using: - HPOV (great but can't handle bursts of alerts) - email (everyone has to have filters, can't handle bursts of alerts that well (runs everyone out of email space!), prone to failure/delays due to email - home grown solution

hnarn · on July 22, 2020

I'm not sure what you mean by "this" but I suspect you mean Nagios XI or some other corporate offering. I was referring to the core monitoring component and its forks/derivatives.

Other than that I don't quite understand the point of your comment, you say you have "severe reservations" but many of the points you list are available even with Nagios core, and most of them are available in other Nagios variants.

thethethethe · on July 21, 2020

Possibly stupid question:

Can you use Nagios to stream metrics exported from your applications binary in real time?

For example, can you use Nagios record each http request processed by your application webserver, tagged with http method, code, latency etc?