IOx: InfluxData’s New Storage Engine

PaulWaldman · on Oct 26, 2022

>Unbounded cardinality

This has been the largest criticism of InfluxDB in the past. Kudos to the team for acknowledging and solving it!

> IOx supports SQL natively and our cloud customers can connect using Postgres-compatible clients like psql, Grafana’s Postgres data source, and BI tools like PowerBI and Tableau.

Initially InfluxDB had InfluxQL, a SQL like language for querying data. Then they transitioned to Flux, indicating it was superior to writing complex SQL queries over time series data. Now they are highlighting native SQL support. Since this was only announced today, hopefully there will be clear messaging on which query languages will be supported going forward.

It’s also worth noting that queries can also be executed over an HTTP API that platforms like PowerBI can consume today.

>First introduced in 2020 as the open source project InfluxDB IOx, the new storage engine is the product of sustained development by InfluxData and considerable contribution from the InfluxDB open source developer community. Today, the new engine based on IOx arrives first in InfluxData’s multi-tenant InfluxDB Cloud service, available to developers worldwide.

Will this later be available in an OSS package for self-hosting?

pauldix · on Oct 26, 2022

Hi, post author and founder of InfluxDB here. We're supporting Flux (our scripting and query language), InfluxQL (our original SQL like language), and SQL (specifically the Postgres dialect as that's what DataFusion supports). The query engine is DataFusion, which is part of the Apache Arrow project. We contribute to it significantly. So that's what's built in natively. We support Flux and InfluxQL through separate Go processes that use an API to connect to the core DB. Although we're working on native InfluxQL support (it's a Rust based InfluxQL parser that will yield DataFusion logical query plans).

Right now we're focused on our cloud offering. We'll have official open source releases and documentation in the future.

minhazm · on Oct 26, 2022

The SQL support is likely because they're using DataFusion which already has pretty good SQL support, so it's sort of "free".

https://arrow.apache.org/datafusion/user-guide/sql/sql_statu...

alamb · on Oct 26, 2022

Author here -- it is "free" in the sense that all the effort we put into DataFusion flows directly into IOx. But we do put a lot of effort into DataFusion

minhazm · on Oct 26, 2022

I didn't mean to imply it's free as in no effort goes into. Just that the underlying library provides it so it's less effort on top of the already significant effort going into DataFusion itself.

alamb · on Oct 26, 2022

Ah -- got it! This is the beauty of aligning ourselves with technologies like Arrow, Parquet and DataFusion. We can share as well as benefit from the efforts of the broader community

mildbyte · on Oct 26, 2022

Just wanted to also give a shout out to Apache DataFusion[0] that IOx relies on a lot (and contributes to as well!).

It's a framework for writing query engines in Rust that takes care of a lot of heavy lifting around parsing SQL, type casting, constructing and transforming query plans and optimizing them. It's pluggable, making it easy to write custom data sources, optimizer rules, query nodes etc.

It's has very good single-node performance (there's even a way to compile it with SIMD support) and Ballista [1] extends that to build it into a distributed query engine.

Plenty of other projects use it besides IOx, including VegaFusion, ROAPI, Cube.js's preaggregation store. We're heavily using it to build Seafowl [2], an analytical database that's optimized for running SQL queries directly from the user's browser (caching, CDNs, low latency, some WASM support, all that fun stuff).

[0] https://github.com/apache/arrow-datafusion

[1] https://github.com/apache/arrow-ballista

[2] https://github.com/splitgraph/seafowl

pauldix · on Oct 26, 2022

DataFusion is great, we're happy to be contributing to it. Also excited to see so many people around the world picking it up and contributing as well. With our development efforts on IOx, it's like a strong tailwind. But we put a ton of effort into helping manage community efforts (thanks, alamb! our developer on IOx that is also on the Arrow PMC).

andygrove · on Oct 26, 2022

Original author of DataFusion/Ballista here. Having alamb and others from InfluxData involved has been a huge help in driving the project forward and helping build an active community behind the project. It is genuinely hard to keep up with the momentum these days!

menaerus · on Oct 26, 2022

Hi, I just had a glance over the DataFusion project. Very interesting work out there which I will be definitely keeping the track of but I've got a genuine question. Do you sometimes find development in Rust a little bit challenging for large-scale and performance sensitive type of work?

I say this because I've noticed more than several PRs fixing (large) performance regressions which to my understanding were mostly introduced due to unforeseen or unexpected Rust compiler subtleties which would then lead to less than optimal code generation. One example of such event was a naive and simply looking abstraction that was introduced and which brought down the performance by something like 50% in TPC-H benchmarks. This really struck me a little bit, especially because it seems quite hard to identify the root cause, and I would like to hear the experiences from the first hand. Thanks a bunch!

nevi-me · on Oct 26, 2022

Your initial experiments and decision to build on arrow-rs has been great for the project. Thank you and everyone involved.

ignoramous · on Oct 26, 2022

> We're heavily using it to build Seafowl, an analytical database that's optimized for running SQL queries directly from the user's browser...

Interesting. Where does seafowl fit in when I compare it with, say, data-stack-in-a-box approach, for ex: meltano + dbt + duckdb + superset [0]? Is my thinking right that seafowl possibly replaces both duckdb (with IOx) and superset (if there's a web front-end)?

Incidentally, dagster had an article up just yesterday making a case for poor-man's datalake with dbt + dagster + duckdb [1]. What does splitgraph replace if I were to use it in a similar setup?

Thanks.

[0] https://archive.is/DxU1e

[1] https://archive.is/5ikU4

mildbyte · on Oct 26, 2022

Great question! With Seafowl, the idea is different from what the modern data stack addresses. It's trying to simplify public-facing Web-based visualizations: apps that need to run analytical queries on large datasets and can be accessed by users all around the world. This is why we made the query API easily cacheable by CDNs and Seafowl itself easy to deploy at the edge, e.g. with Fly.io.

It's a fairly different use case from DuckDB (query execution for Web applications vs fast embedded analytical database for notebooks) and the rest of the modern data stack (which mostly is about analytics internal to a company). Just to clarify, we're not related to IOx directly (only via us both using Apache DataFusion).

If we had to place Seafowl _inside_ of the modern data stack, it'd be mostly a warehouse, but one that is optimized for being queried from the Internet, rather than by a limited set of internal users. Or, a potential use case could be extracting internal data from your warehouse to Seafowl in order to build public applications that use it.

We don't currently ship a Web front-end and so can't serve as a replacement to Superset: it's exposed to the developer as an HTTP API that can be queried directly from the end user's Web browser. But we have some ideas around a frontend component: some kind of a middleware, where the Web app can pre-declare the queries it will need to run at build time and we can compute some pre-aggregations to speed those up at runtime. Currently we recommend querying it with Observable [0] for an end-to-end query + visualization experience (or use a different viz library like d3/Vega).

Re: the second question about Splitgraph for a data lake, the intention behind Splitgraph is to orchestrate all those tools and there the use case is indeed the modern data stack in a box. It's kind of similar to dbt Labs's Sinter [1] which was supposed to be the end-to-end data platform before they focused on dbt and dbt Cloud instead: being able to run Airbyte ingestion, dbt transformations, be a data warehouse (using PostgreSQL and a columnar store extension), let users organize and discover data at the same time. There's a lot of baggage in Splitgraph though, as we moved through a few iterations of the product (first Git/Docker for data, then a platform for the modern data stack). Currently we're thinking about how to best integrate Splitgraph and Seafowl in order to build a managed pay-as-you-go Seafowl, kind of like Fauna [2] for analytics.

Hope this helps!

[0] https://observablehq.com/@seafowl/interactive-visualization-...

[1] https://www.getdbt.com/blog/whats-in-a-name/

[2] https://fauna.com/

michael_j_ward · on Oct 26, 2022

Just want to say congratulations to the team!

2 years and 9,500+ commits is a hell of a feat.

https://github.com/influxdata/influxdb_iox

toinbis · on Oct 26, 2022

Happy longtime Influxdb user here. I wanted to congratulate Paul and the team on reaching this milestone. Followed IOx development a bit - can't wait to finally test it out!

okay_dude_q · on Oct 26, 2022

I evaluated InfluxDB with the Prometheus Kube Stack chart.

It didn’t really work.

I’m not stupid and I can read docs.

My feeling was it’s like Elastic. Default configuration is so flawed and inscrutable, on purpose, you can forget about using it yourself.

I use Thanos now. At least it fucking works.

I suppose if I need fast queries, I’ll use Postgres.

You guys need to focus on making stuff that works. It’s competitive out there and you don’t have the insights into people who try and wind up hating your guts for being annoying.

candrewlee14 · on Oct 27, 2022

Congrats! Was always a pleasure to hear about IOx when I interned there last summer! They’re an awesome company to work for.

otoolep · on Oct 26, 2022

Congrats to the team at InfluxDB - great to see this released.

mrsun · on Oct 26, 2022

Will InfluxDB IOx eventually replace InfluxDB v2?

mhall119 · on Oct 26, 2022

IOx is the data storage layer. It will replace the current TSM data storage system in InfluxDB, but it won't replace InfluxDB as a whole.

digerata · on Oct 26, 2022

Personally, very excited to see this happening. Huge congrats!

Some constructive criticism around naming... You don't have to have Flux in every single damn thing you create!

InfluxDB IOx is not replacing InfluxDB v2 because... It's just a new storage engine.

For querying we have Flux or InfluxQL...

eskaytwo · on Oct 27, 2022

Will be very interesting to see this compared to Clickhouse

zX41ZdbW · on Oct 27, 2022

Here is a recent comparison: https://arxiv.org/pdf/2204.09795.pdf Although I'm not sure if it is using the new IOx engine or not.

eskaytwo · on Oct 27, 2022

Thanks. It’s odd they do such in-depth analysis but don’t mention the versions used. I think that is on the old engine. It would be very interesting to see how the new engine performs - I imagine the performance gets closer to something like Vaex which was using similar Apache tooling

_peter_ · on Oct 26, 2022

Isn't InfluxDB rewriting their storage engine for the nth time? It makes me have a little less faith in their project to be honest.

dgnorton · on Oct 26, 2022

Member of the engineering team here - I would break the history into 3 phases:

1) Alpha / Beta phase where we experimented with several off-the-shelf key-value stores (RocksDB, LevelDB, & BoltDB). During this early phase, we learned from observing a wide variety of workloads / use-cases that we needed a custom built engine to achieve our early performance goals. But, using these off-the-shelf key-value stores allowed our (at the time) very small team to focus on developing a useful beta product and gathering user feedback.

2) TSM storage engine for 1.0 - Developed from scratch based on our learnings from phase 1, this was the first production storage engine that shipped with 1.0 in 2016 and carried us through 2.0. It served as the workhorse for 3 - 4 years as both the number of users and size of their workloads skyrocketed, eventually bumping into architectural limits of TSM.

3) IOx - equipped with a larger engineering team and years of experience with a wide variety of workloads and use-cases, IOx was developed to handle rapidly growing time series workloads that users need to handle.

c4wrd · on Oct 26, 2022

I would argue the other way and praise them for the storage engine changes. Each iteration has had drawbacks, but based on the real-world reported usage they've made decisions to better support what customers are asking for and actually running into, as opposed to trying to iterate on the same engine over and over and making assumptions of real-world usage. Sure, there are drawbacks, but at the end of the day they're continuing to make good improvements for their customers.

mhall119 · on Oct 26, 2022

The original TSM engine is still used by InfluxDB v2 OSS.

The InfluxDB Cloud platform uses a variation of TSM that's tailored for a distributed SaaS rather than stand-alone nodes (this was originally intended to be used in InfluxDB v2 OSS as well, but alpha-testing showed that the old engine performed better there so it ultimately was reverted for the beta release).

So IOx is really the first major new storage engine in InfluxDB.

linsomniac · on Oct 27, 2022

IOW: The first iteration has to be perfect?

Why that vs. Fred Brooks' "Plan to throw the first one away" idea?