Tracy: A real time, nanosecond resolution frame profiler

vblanco · 2024-09-24T08:39:19 1727167159

A truly incredible profiler for the great price of free. There is nothing coming at this level of features and performance even on paid software. Tracy could cost thousands of dollars a year and would still be the best profiler.

Tracy requires you to add macros to your codebase to log functions/scopes, so its not an automatic sampling profiler like superluminal, verysleepy, VS profiler, or others. Each of those macros has around 50 nanoseconds of overhead, so you can liberally use them in the millions. On the UI, it has a stats window that will record average, deviation, min/max of those profiler zones, which can be used to profile functions at the level of single nanoseconds.

Its the main thing i use for all my profiling and optimization work. I combine it with superluminal (sampling profiler) to get a high level overview of the program, then i put tracy zones on the important places to get the detailed information.

eagle2com · 2024-09-24T10:03:35 1727172215

Doesn't Tracy have the capability to do sampling as well? I remember using it at some point, even if it was finicky to setup because windows.

vblanco · 2024-09-24T10:27:51 1727173671

it does, but i dont use it much due to it being too slow and heavy on memory on my ryzen 5950x (32 threads) on windows. a couple seconds of tracing goes into tens of gigabytes of ram.

forrestthewoods · 2024-09-24T15:02:26 1727190146

Yeah I had issues with the Tracy sampler. It didn’t “just work” the way Superluminal did.

My only issue with Superluminal is I can’t get proper callstacks for interpreted languages like Python. It treats all the CPP callstacks as the same. Not sure if Tracy can handle that nicely or not…

forrestthewoods · 2024-09-24T12:49:22 1727182162

Tracy and Superluminal are the way. Both are so good.

Flex247A · 2024-09-24T08:42:27 1727167347

Hello! Going through your tutorial and it's been a great ride!

Thanks for the good work.

Flex247A · 2024-09-24T07:58:09 1727164689

I am a beginner in graphics programming, and I came across this amazing frame profiler.

Web demo of Tracy: https://tracy.nereid.pl/

This blows my mind. It's so fast and responsive I never expected a WebAssembly application to be!

gcr · 2024-09-24T13:26:55 1727184415

My favorite FOSS video game, Dr. Robotnik’s Ring Racers (http://kartkrew.org), has Tracy support! It’s not compiled into the default build. I’ve learned a lot reading the code.

mastax · 2024-09-24T14:54:05 1727189645

This article is a good quick introduction to Tracy: https://luxeengine.com/integrating-tracy-profiler-in-cpp/

mastax · 2024-09-24T13:59:13 1727186353

I just started using this yesterday, it looks really good. Haven’t really dug into it.

Is the latest windows build broken for anyone else? It doesn’t start. In WinDbg it looks like it dereferences a null pointer. I built it myself and it works fine.

mastax · 2024-09-24T14:56:49 1727189809

Of course I come into work the next day and now I can't run my custom built one either...

mastax · 2024-09-24T16:04:08 1727193848

https://github.com/wolfpld/tracy/issues/887

MSVC changed the mutex constructor to constexpr, breaking binary backward compatibility. They say WONTFIX, you must use the latest MSVCRT with the latest MSVC. But I have the latest MSVCRT installed? Whatever - a workaround was pushed to master yesterday.

drpossum · 2024-09-24T09:28:34 1727170114

Can someone explain how this achieves nanosecond resolution? That's an extremely difficult target to reach on computing hardware due to inherent clock resolutions and interrupt timing.

simonask · 2024-09-24T09:52:32 1727171552

There are several sources of timing information, and I think in this context "nanosecond precision" just means that Tracy is able to accurately represent and handle input in nanoseconds.

The resolution of the actual measurements depends on the kind of measurement:

1. If the measurement is based on high resolution timers on the CPU, the resolution depends on the hardware and the OS. On Windows, `QueryPerformanceFrequency()` returns the resolution, and I believe it is often in the order of 10s or 100s of nanoseconds.

2. If the measurement is based on GPU-side performance counters, it depends on the driver and the hardware. Graphics APIs allow you to query the "time-per-tick" value to translate from performance counters to nanoseconds. Performance counters can be down to "number of instructions executed", and since a single instruction can be on the order of 1-2 nanoseconds in some cases, translating a performance counter value to a time period requires nanosecond precision.

3. Modern GPUs also include their own high-precision timers for profiling things that are not necessarily easy to capture with performance counters (like barriers, contention, and complicated cache interactions).

drpossum · 2024-09-24T10:03:31 1727172211

Yes, that's my understanding and why I asked. I disagree about "in this context", though, which is a pitch. If I was going to buy hardware that claimed ns resolution for something I was building I would expect 1ns resolution, not "something around a few ns" and not qualified "only on particular hardware". If such a product were presenting itself in a straightforward way to be compared to similar products and respecting the potential user it would say "resolutions down to a few ns" or something more specific but accurate.

There was even a discussion on this not long ago on how to market to technical folks and things to not do (this is one of the things not to do)

https://www.bly.com/Pages/documents/STIKFS.html

https://news.ycombinator.com/item?id=41368583

vardump · 2024-09-24T09:40:29 1727170829

On x86/AMD64 it uses CPU's TSC clock.

https://github.com/wolfpld/tracy/blob/master/public/client/T...

Galanwe · 2024-09-24T10:05:24 1727172324

It does reach nanosecond only in the sense that its sampling profiler can report nanosecond resolution. I've tried the event profiler for microsecond sensitive projets though, and it blows up the timings and latency even at low event frequency.

vardump · 2024-09-24T13:17:02 1727183822

I think it’s mostly due to Tracy’s poor timing calibration code. TSC is good accuracy and latency wise.

Galanwe · 2024-09-24T16:45:10 1727196310

The problem is not the timestamping, it's the queue used to push profiling events which is not fast enough

vardump · 2024-09-24T16:57:27 1727197047

That queue is about as fast as it gets, <10 ns. Timestamp is taken before queueing.

Again, due to bad calibration code the measured timestamps have quite a bit jitter.

Edit: TSC might not be synchronized in multi-socket systems. (Multiple physical CPU sockets). That can generate a large error.

cwbaker400 · 2024-09-24T09:27:38 1727170058

Tracy is brilliant. @wolfpld I hope you're enjoying reading this and all of the other great comments in this thread. Great work and thank you very very much!

Green-Man · 2024-09-24T15:16:51 1727191011

Does anybody have an opinion or comparison with respect to easy_profiler?

https://github.com/yse/easy_profiler

Especially interesting if based on real practical experience.

boywitharupee · 2024-09-24T11:54:50 1727178890

can someone explain how is profiling tools like this written for GPU applications? wouldn't you need access to internal runtime api?

for ex. Apple wraps Metal buffers as "Debug" buffers to record allocations/deallocations.

MindSpunk · 2024-09-24T12:31:29 1727181089

Some graphics APIs support commands that tell the GPU to record a timestamp when it gets to processing the command. This is oversimplified, but is essentially what you ask the GPU to do. There’s lots of gotchas in hardware that makes this more difficult in practice as a GPU won’t always execute and complete work exactly as you specify at the API level if it’s safe to. And the timestamp domain isn’t always the same as the CPU.

But in principle it’s not that different to how you just grab timestamps on the CPU. On Vulkan the API used is called “timestamp queries”

It’s quite tricky on tiled renderers like Arm/Qualcomm/Apple as they can’t provide meaningful timestamps at much tighter granularity than a whole renderpass. I believe Metal only allows you to query timestamps at the encoder level, which roughly maps to a render pass in Vulkan (at the hardware level anyway)

ossobuco · 2024-09-24T11:56:49 1727179009

I don't know about Tracy, but I've seen a couple WebGPU JS debugging tools simply intercepting calls to the various WebGPU functions like writeBuffer, draw, etc, by modifying the prototypes of Device, Queue and so on[0].

- 0: https://github.com/brendan-duncan/webgpu_inspector/blob/main...

e-dant · 2024-09-25T01:08:32 1727226512

I always squint a little bit when folks mention nanoseconds.

The overheard of observing the time is around 60 nanoseconds.

Light travels about a foot in a nanosecond.

Below ~500 nanoseconds, things get fuzzy.

throwawaymaths · 2024-09-24T16:42:26 1727196146

Can anyone with experience suggest pointers on arguing Tracy vs perfetto? My team uses perfetto and I highly suspect we are running into artefacts due to that.