We Built a 19 PiB Logging Platform with ClickHouse and Saved Millions

rorycrispin · 2024-04-02T14:18:17 1712067497

Hey! I'm the original author of this post. I'm so excited to share our journey with ClickHouse and the open source Observability world, I'll be happy to answer any questions you may have!

tesfa · 2024-04-03T05:58:06 1712123886

Very cool write up. I'm curious about any challenges you had using Grafana? Also, do you think this sort of system would work as an alternative to Splunk as well?

rorycrispin · 2024-04-03T08:44:37 1712133877

Grafana works really well out of the box for most use-cases. Initially the LogHouse UI was built using the out-of-the-box Grafana tools (along with the ClickHouse data source plugin which we maintain) You can get a really long way with the zero code dashboarding tools, especially with the latest plugin release (4.0) which comes with a completely rebuilt query builder which has an opinionated mode specifically for OpenTelemetry data. LogHouse exhausted the limits of the zero-code UI in a few places and it was necessary to evolve the UI from a dashboard into a Grafana Plugin. Doing so gives you much more control to build a full application using the Grafana primitives. I really like that as an SRE I can build a webapp without spending any time building UI components, instead just controlling the layout on page and declaring “this panel has the following `SELECT…` query” which is generated by some typescript function. Examples of things which we can do on top of Scenes are:

- Pick a different schema based on the query parameters. For instance we have different schemas for different applications (Keeper/Server/Generic K8s app) and the app picks the necessary schema

- Always show the full generated SQL query on the page (We like to use Grafana UI to start off and then jump into fully manual SQL for deeper analysis)

- Take one filter value (for instance, k8s namespace) and look up all of the other filters required (pod names which were live during the time period, region, cell ect.)

- Some small gadgets like enabling users to import the time range from another application URL like DataDog. Oftentimes we start by looking at metrics in another source and then want to jump into the logs.

BigAl7070 · 2024-04-03T21:31:21 1712179881

Can you share some details of how you implemented the cross region routing in Grafana? I think the article mentioned that you created your own plugin, is that plugin open sourced?

We would like to do something similar, but not sure where to start.

rorycrispin · 2024-04-04T09:24:22 1712222662

Sure, the plugin is built on top of Scenes https://grafana.com/developers/scenes

Add all of your required data sources will well known UIDs https://grafana.com/docs/grafana/latest/administration/provi... Then you can create a query panel and update the state with the desired query SQL and data source ID like so;

    const logsQuery = new SceneQueryRunner({queries: []});
    logsQuery.setState({
        datasource: { uid: dataSourceID},
        queries: [
            {
                "datasource": {
                    "type": "grafana-clickhouse-datasource",
                    "uid": dataSourceID,
                },
                "queryType": "sql",
                "rawSql": logsQuerySQL,
    ...
        ],
    });

everfrustrated · 2024-04-02T21:00:03 1712091603

Great write up.

>The recent efforts to move the JSON type to production-ready status will be highly applicable to our logging use case. This feature is currently being rearchitected, with the development of the Variant type providing the foundation for a more robust implementation. When ready, we expect this to replace our map with more strongly typed (i.e. not uniformly typed) metadata structures that are also possibly hierarchical.

Very happy to see ClickHouse dogfooding itself for storing logs - hope this will help to hasten the work on improving the the json type more suitable to dynamic documents.

tbragin · 2024-04-02T23:42:55 1712101375

Yes, we are working on it! :) Taking some of the learnings from current experimental JSON Object datatype, we are now working on what will become the production-ready implementation. Details here: https://github.com/ClickHouse/ClickHouse/issues/54864

Variant datatype is already available as experimental in 24.1, Dynamic datatype is WIP (PR almost ready), and JSON datatype is next up. Check out the latest comment on that issue with how the Dynamic datatype will work: https://github.com/ClickHouse/ClickHouse/issues/54864#issuec...

ankitnayan · 2024-04-03T13:24:02 1712150642

Interesting post.

How do you apply restrictions on your queries? Otherwise a few concurrent queries scanning huge data or being slow due to groupby, etc can slowdown the system.

Also, I see a sorting key of `ORDER BY (PodName, Timestamp)`. While debugging, filtering by service_name, deployment_name, env, region, etc is probably going to be slow?

GrumpyNl · 2024-04-03T07:24:09 1712129049

Its a log. When a log is that big, is it still use full?

rthnbgrredf · 2024-04-03T07:40:28 1712130028

The use-case of 19 PiB of logging data feels very constructed to me. I worked for smaller and bigger companies and never faced logging into the petabyte range. I'm not saying it's not a thing, FAANG level companies certainly have such needs, but they have their own large scale solutions already. The question remains, besides bragging, who is the average Joe with 19 PiB of logging data you might want to address as potential customer?

What would be useful from my perspective are benchmarks in the more common terabyte range. How much faster is it to query compared to existing cloud offering, what features does e.g. Datadog vs Clickhouse has to analyze the data? In the end the raw data is not much useful if you cannot easily find and extract meaningful data out of it.

ryadh · 2024-04-03T09:02:13 1712134933

We explored a scale similar to what you described in another blog: https://clickhouse.com/blog/cost-predictable-logging-with-cl...

Feature differentiation is actually a pretty interesting topic for o11y. You can do many things with an OLAP store but you need to be aware of the differences with of the shelf solutions, I try to summarize it here: https://clickhouse.com/blog/the-state-of-sql-based-observabi...

I hope this helps! I'd love to hear your opinion about it

rthnbgrredf · 2024-04-03T12:33:24 1712147604

Thanks, this looks indeed much more interesting. I'm definitely looking into this.