More

averne_ · 2025-07-29T14:22:24 1753798944

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.

Almondsetat · 2025-07-29T14:36:09 1753799769

95% GPU usage but only x2 faster than the reference SIMD encoder/decoder

actionfromafar · 2025-07-29T15:17:01 1753802221

What I wonder is, how do you get the video frames to be compressed from the video card into the encoder?

The only frame capture APIs I know, take the image from the GPU, to CPU RAM, then you can put it back into the GPU for encoding.

Are there APIs which can sidestep the "load to CPU RAM" part?

Or is it implied, that a game streaming codec has to be implemented with custom GPU drivers?

Almondsetat · 2025-07-29T15:21:19 1753802479

Some capture cards (Blackmagic comes to mind) have worked together with NVIDIA to expose DMA access. This way video frames are automatically transferred from the card to the GPU memory bypassing the RAM and CPU. I think all GPU manufacturers expose APIs to do this, but it's not that common in consumer products.

Const-me · 2025-07-29T15:35:44 1753803344

> Are there APIs which can sidestep the "load to CPU RAM" part?

On windows that API is Desktop Duplication. The API delivers D3D11 textures, usually in BGRA8_UNORM format. When HDR is enabled you would need slightly different API method which can deliver HDR frames in RGBA16_FLOAT pixel format.

mmozeiko · 2025-07-29T18:54:13 1753815253

There's also Windows.Graphics.Capture. It allows to get texture not only for whole desktop, but just individual windows.

LtdJorge · 2025-07-29T17:36:35 1753810595

On Linux you should look into GStreamer and dmabuf.

averne_ · 2025-07-28T22:04:12 1753740252

Hardware GPU encoders refer to dedicated ASIC engines, separate from the main shader cores. So they run in parallel and there is no performance penalty for using both simultaneously, besides increased power consumption.

Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.

miladyincontrol · 2025-07-29T04:55:25 1753764925

Im pretty sure they arent dedicated ASIC engines anymore. Thats why hacks like nvidia-patch are a thing where you can scale up NVENC usage up to the full GPU's compute rather than the arbitrary limitation nvidia adds. The penalty for using them within those limitations tends to be negligible however.

And on a similar note, NvFBC helps a ton with latency but its disabled on a driver level for consumer cards.

theshackleford · 2025-07-29T06:39:37 1753771177

> Im pretty sure they arent dedicated ASIC engines anymore.

They are. That patch doesnt do what you think it does.

averne_ · 2025-06-25T09:27:27 1750843647

Matrix instructions do of course have uses in graphics. One example of this is DLSS.

Agentlien · 2025-06-25T15:47:16 1750866436

This feels backwards to me when GPUs were created largely because graphics needed lots of parallel floating point operations, a big chunk of which are matrix multiplications.

When I think of matrix multiplication in graphics I primarily think of transforms between spaces: moving vertices from object space to camera space, transforming from camera space to screen space, ... This is a big part of the math done in regular rendering and needs to be done for every visible vertex in the scene - typically in the millions in modern games.

I suppose the difference here is that DLSS is a case where you primarily do large numbers of consecutive matrix multiplications with little other logic, since it's more ANN code than graphics code.

averne_ · on July 3, 2024

Self-plug, but I wrote an open-source NVDEC driver for the Tegra X1, working on both the Switch OS and NVidia's Linux distro (L4T): https://github.com/averne/FFmpeg.

It currently integrates all the low-level bits into FFmpeg directly, though I am looking at moving those to a separate library. Eventually, I hope to support desktop cards as well with minimal code changes.

averne_ · on May 29, 2024

The mushrooms are imported from China or Poland as mycelium, and the harvest is done in France. Since the law distinguishes between mycelium and mushroom, the mushroom were technically produced in France.

https://web.archive.org/web/20240121180131/https://www.reddi...

averne_ · on Nov 30, 2023

It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.

See this post https://news.ycombinator.com/item?id=35418066

averne_ · on Aug 31, 2023

This isn't true anymore. It was their first approach, but since then they have switched to their own JIT recompiler. You can read their rationale here: https://github.com/Ryujinx/Ryujinx/pull/693

For the MacOS port, they also added an ARM-to-ARM JIT in case hypervisor runs into issues.

steelframe · on Aug 31, 2023

> they have switched to their own JIT recompiler ... they also added an ARM-to-ARM JIT in case hypervisor runs into issues

Having worked in industry for some multiple of decades, I can say with some confidence that if a small team were to successfully build and deploy their own JIT recompiler to solve an actual customer problem, they would be considered gods by management and given bonuses and promotions. Realistically the project would never get off the ground because their management would be pressuring them to hurry up and add some random new API within the next quarter. Most devs just looking at the existing code and are more or less doing a copy-and-paste with some minor tweaks for their new "feature." They're trying to slap together quick demos that are little more than, "And now this thing makes an RPC call to that thing." The level of performance I see from people who pull down well into 6 figures of income typically falls well below the level that I'm seeing in this Switch emulator project.

Usually for something like that to actually happen it takes a VP mobilizing an org of size 20+, with multiple layers of management taking a year or more to hire or steal talent from other orgs. Some companies are built differently (i.e., Apple) and can pull cross-org talent together for something like "get Intel binaries running pretty well on M1." But I find that tends to be the exception rather than the norm for larger tech companies.

Maybe I'm just working in the wrong places.

IggleSniggle · on Aug 31, 2023

That's the difference between people working for a paycheck and people working for a passion project. Even if 90% of the people working at a company are doing it for the passion and not the paycheck, they still have to contend with the other 10% who do not have the passion (but may be good at hiding this fact).

With a passion project, you start and stop whenever you want, and if you're not interested, you're not working on it anymore. So only the people who are truly intrinsically motivated will continue.

A paycheck is a form of compulsion. "You could be do anything, but you're doing this for me specifically because I pay you." You might also be very interested, but it's the only thing that specifically binds you to the company vs bound to the work itself.

imtringued · on Sept 1, 2023

I don't get it. For me the biggest problem isn't the "ulterior" motive of the paycheck but rather that you simply don't get to work on exciting problems to begin with.

If you told me I have eight hours a day to work on this and I have to spend those eight hours a day, you will get a far better emulator than if I had to do this only on the weekends and I could quit at any point and that is how most personal projects end up. In some half baked state and nobody uses it.

whizzter · on Aug 31, 2023

I think that first approach (Part of the name RyujinX from RyuJIT?) while not optimal did get them off the ground quickly, now with a bit of traction (people into the project since it actually functions) they can could easily find takers(or the time) to write the improved JIT.

I'm actually tinkering on a WASM runtime and emitting MSIL code from WASM trees is quite straightforward so far (tho running into some more complicated cases now that I'm integrating the test-suite), compared to the non-trivial (code stamping) pure native JIT's I've done in the past it's quite a big timesaver (and those still only did target one CPU platform).

averne_ · on May 8, 2023

There are OpenGL extensions which can import a provided GPU buffer as a texture, using those you can achieve zero-copy.

For instance, with VAAPI->OpenGL you would use vaExportSurfaceHandle in conjunction with glEGLImageTargetTexture2DOES.

Check out the "hwdec" mechanism in MPV:

https://github.com/mpv-player/mpv/blob/master/video/out/hwde...

vlovich123 · on May 8, 2023

Sure. I think the part that’s missing is that FFMPEG runs out of process and doesn’t deal in GPU buffers/textures.

cillian64 · on May 8, 2023

If you use ffmpeg’s libavcodec interface then you can get it to give you the decoded framebuffers as exported DRM-prime descriptors which you can turn into textures. This is how Firefox does video decode with VAAPI, using libavcodec as a wrapper.

Edit: missed the part about JS ecosystem. You can move DRM prime descriptors between processes, but I assume you can’t do this from the ffmpeg CLI and would need to write your own little C wrapper around libavcodec

averne_ · on Dec 13, 2022

You can just use __builtin_popcount or equivalent, which maps to a single instruction on most platforms.

averne_ · on Nov 29, 2022

The nouveau project used a kernel module to intercept mmio accesses: https://nouveau.freedesktop.org/MmioTrace.html. Generally speaking hooking onto driver code is one of the preferred ways of doing dynamic reverse engineering. For userspace components, you can build an LD_PRELOAD stub that logs ioctls, and so on.