I wonder if anyone can answer some question on distributed tracing for me. The d...

vmihailenco · on Sept 6, 2022

My understanding is that APM became or always was a marketing term which is used rather freely. For that reason I try to avoid it, but search engines love it and I don't know a better alternative.

>Whether I can find out why that component is slow might not be as easy (not sure what granularity tracing happens inside a component).

It is true that you can't always guess what operation going to be slow and instrument it, but it is almost always a network or a database call. There is still no way to tell *why* it is slow, but the more data you have the more hints you get.

>What's the usual overhead of a single tracing

Depending on what is your base comparison point the answer can be very different.

Usually, you trace or instrument network/filesystem/database calls and in those cases the overhead is negligible (few percents at most).

>But does that mean taking a trace is very cheap?

What you've described is tail-based sampling and it only helps with reducing storage requirements. It does not reduce the sampling overhead. Check https://uptrace.dev/opentelemetry/sampling.html

But is taking a trace cheap? Definitely. Billions of traces? Not so.

>request sampling decision something that's common or that's a totally unique capability some have.

It is a common practice to reduce cost of sampling when you have billions of traces, but it is an advanced feature because it requires backends to buffer incoming spans in memory so you can decide if the trace is anomalous or not.

Besides, you can't store only anomalous traces because you will lose a lot of useful details and you can't really detect anomaly without knowing what is the norm.

Hopefully that helps.

tmd83 · on Sept 7, 2022

By traditional APM I primarily meant stacktrace sampling based monitoring of applications.

As for overhead of tracing I wanted roughly compare (obviously it depends on the application a lot) stacktrace sampling vs. tracing based one. Are they usually of similar overhead or say tracing is lighter?

I was thinking tail based sampling could be a lot more expensive because say a head based sampling is doing trace for 10% request whereas regardless of how many sample are kept a tail based one is dong 100% trace. So tracing overhead would be much higher right?

I'm not sure why head based sampling is being called accurate in your doc? Isn't it the least accurate in a sense that it's purely statistical and rare outlier like latency spike or error could be missed?

And yes obviously a tail based sampling has to be something like (trace 5% random request or 1 every five + any outlier that gets calculated based on the captured trace)/