Hacker News new | past | comments | ask | show | jobs | submit login

I wonder if anyone can answer some question on distributed tracing for me.

The difference between old days of APM vs. tracing as I understand is two things.

1. Originally APM was single process and it was language aware, usually do sampling stacktrace to find where times are being taken and some very well know place to instrument for exact timing say response time or query time.

Tracers are more working by instrumenting methods of framework/servers/runtime at well known point and getting the timing. In man ways it's a lot more coarse as it might know of a hot loop that I have in my code. But it can trace very well with exact timing at framework boundary like web, cache, db etc.

2. The APM were primarily single process and couldn't really show a different service/process which doesn't work in a micro-service/distributed world.

The way I understand it is that Tracers would allow me to narrow down to the service/component very easily. Whether I can find out why that component is slow might not be as easy (not sure what granularity tracing happens inside a component).

I wonder if this understanding of mine is correct.

The second thing I am really unsure about is sampling and overhead. What's the usual overhead of a single tracing (I know it's variable) but generally are they more expensive at a single request level. Also do they usually sample and is there a good/recommend way to sample this. I forgot exactly who but (probably NewRelic) was saying they collect all traces (like every request?) and discard if they are not anomalous (to save on storage). But does that mean taking a trace is very cheap? And is that end of the request sampling decision something that's common or that's a totally unique capability some have.




My understanding is that APM became or always was a marketing term which is used rather freely. For that reason I try to avoid it, but search engines love it and I don't know a better alternative.

>Whether I can find out why that component is slow might not be as easy (not sure what granularity tracing happens inside a component).

It is true that you can't always guess what operation going to be slow and instrument it, but it is almost always a network or a database call. There is still no way to tell *why* it is slow, but the more data you have the more hints you get.

>What's the usual overhead of a single tracing

Depending on what is your base comparison point the answer can be very different.

Usually, you trace or instrument network/filesystem/database calls and in those cases the overhead is negligible (few percents at most).

>But does that mean taking a trace is very cheap?

What you've described is tail-based sampling and it only helps with reducing storage requirements. It does not reduce the sampling overhead. Check https://uptrace.dev/opentelemetry/sampling.html

But is taking a trace cheap? Definitely. Billions of traces? Not so.

>request sampling decision something that's common or that's a totally unique capability some have.

It is a common practice to reduce cost of sampling when you have billions of traces, but it is an advanced feature because it requires backends to buffer incoming spans in memory so you can decide if the trace is anomalous or not.

Besides, you can't store only anomalous traces because you will lose a lot of useful details and you can't really detect anomaly without knowing what is the norm.

Hopefully that helps.


By traditional APM I primarily meant stacktrace sampling based monitoring of applications.

As for overhead of tracing I wanted roughly compare (obviously it depends on the application a lot) stacktrace sampling vs. tracing based one. Are they usually of similar overhead or say tracing is lighter?

I was thinking tail based sampling could be a lot more expensive because say a head based sampling is doing trace for 10% request whereas regardless of how many sample are kept a tail based one is dong 100% trace. So tracing overhead would be much higher right?

I'm not sure why head based sampling is being called accurate in your doc? Isn't it the least accurate in a sense that it's purely statistical and rare outlier like latency spike or error could be missed?

And yes obviously a tail based sampling has to be something like (trace 5% random request or 1 every five + any outlier that gets calculated based on the captured trace)/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: