The current code with the function pointers and pods works ok for me, its about ...

vardump · on June 15, 2016

If the chain of operations change rarely, you might even get away with calling clang or whatever compiler and actually compiling a chain of filter source code mashed together.

iammyIP · on June 15, 2016

So how would you raw-structure a performance critical (audio) program that needs some flexible path for the data processing, like with a 128 sample buffer and about 3-10 choices along the processing path. Processing path should be switchable per buffer frequency. The function pointers were the best solution i could come up with, but maybe that doesnt really eliminate branching cache misses and stalling and its impact on performance, but just hides it in plain-text.

vardump · on June 15, 2016

How many permutations? If not too many, maybe just compile them all.

Sometimes you do need to use function pointers. It's just if it's in a performance critical path, you might want to consider alternatives.

iammyIP · on June 15, 2016

Permutations are for my own case pretty low, it's just a replacement for a handful of switches, otherwise the flow is pretty straightforward. But i see that if one wants to restructure the whole program flow on-the-fly it might lead to problems. What are the alternatives? The link you provided is for graphics, and does not seem to be generalised. Do you suggest a complete switch to a jit interpreted codebase?

vardump · on June 15, 2016

Graphics and audio are usually pretty similar, other than the volume of data is usually quite a bit larger for graphics than for audio.

There's no silver bullet how to structure such flexible, dynamically configurable high performance system.

iammyIP · on June 15, 2016

I disagree with 'pretty similar' and would suggest 'challenging in completely different ways'. Graphics in a minimum 60 fps context is about 700 times less time sensitive. That's almost three magnitudes difference. On top of that it isn't a major fail if you don't meet the 60 on time, but in audio it is. Furthermore audio tends to be more single thread focused, while graphics is much more multi thread friendly (see your link to llvmpipe and general gpu architectures). Graphics need to shovel around about some hundreds of megabytes per second and is less time critical, whereas audio is ultimately time critical and only needs to produce about 350kb per second for 32 bit stereo (the data difference you already mentioned). That's again some three orders of magnitude of difference. I wouldn't call that similar. Since you can't provide any advise for an alternative to function pointers i must challenge your expertise in audio programming since you seem to project your view from a graphics programming point.

LnxPrgr3 · on June 16, 2016

Unless your hardware has to be fed sample-at-a-time, your deadline isn't 700 times worse. A fairer comparison would point out for HD video, you have mere nanoseconds per pixel, vs. microseconds for audio samples.

That is, my laptop has ~43 cycles to get out all channels of each pixel, vs. more than 60,000 cycles to get out all channels of each 44.1kHz audio sample.

But the pixels only have to be delivered by screen refresh, and the samples only have to be delivered before the sound card tries to play them--probably 64-1024 samples (or further) in the future.

vardump · on June 15, 2016

Well, I haven't done audio programming for quite a while. I did get audio hardware behave pretty nicely in 1998, though. I believe low latency audio was much harder back then.

I'm pretty surprised if your deadline is really that 23 microseconds (60 fps / 7000). You can't even reliably dispatch an IRQ request in that time!

> I wouldn't call that similar

If you have a long audio filter chain it's not all that different from a long fragment shader. You can think of a fragment shader as a filter chain.

> Since you can't provide any advise for an alternative to function pointers

Uh. That's pretty hard without seeing the source code and full problem description. Especially because you seem to be expecting some magical solution. Well, those don't exist, it's just hard work.

Besides, like I said earlier, sometimes function pointers are appropriate for the task. Just understand the cost. Do keep in mind audio filter chains can be pretty complicated. If your filter chain has many concurrent channels with hundreds of dynamic filter steps for each sample to pass through, yeah, you might have some performance problems.

> i must challenge your expertise in audio programming since you seem to project your view from a graphics programming point.

Be my guest. Just wanted to share something I know about and give new ideas. Btw, I'm not doing graphics, but mostly I/O with micro/millisecond level deadlines.

iammyIP · on June 16, 2016

Getting audio hardware to behave nicely for audio is pretty nice :) The calculation for the time factor difference is simply 44100/60 which is about 733, i don't know how the 23 microseconds you got there relate to this.

Your comparision to graphics seem to relate to long FIR filters, which i dont't use because of cache reasons (big kernel/lookup table). FIR filters can however sound pretty nice with static parameters but get extremely expensive for realtime modulation. For realtime modulation i suggest IIR filters (Finite Impulse Response vs Infinite Impulse Response filter architecture).

Of course it's hard to give advice into the blue without any source, but i atleast expected some general hint, like 'avoid branches, use function pointers' on a next-level, since you were so opposed to function pointers to begin with.

I am also not completely sure what's the best approach in this regard, and i surely did not enough profiling in this regard.

That said, i remain sceptical regarding your critique of function pointers in a single threaded audio application, because of a lack of an alternaltive.

I/O on a microsecond critical deadline certainly seems to be in a neigbourhood to these problems. Do you use some arcane javascript asm.js with some jit auto-reconfiguring optimising for this, or how would you describe your approach?

vardump · on June 16, 2016

> i don't know how the 23 microseconds you got there relate to this

1 / (60 Hz * 700) = 23.8 microseconds. (Yeah, should have rounded up to 24).

What worked for me in the past was just concatenating (well, memcpying) executable filter blocks of x86 code while ensuring parameters were in the right registers.

Loop start block -> filter 1 -> filter 2 -> filter 3 ... -> loop end block.

I made these blocks by processing specifically compiled (not at runtime, but ahead of time) functions, stripping prologues and epilogues. There's more to it, but it's been a while.

Crude, but effective and fast even when each filter step was pretty small on its own.

iammyIP · on June 16, 2016

So you put in the parameters into registers 'per hand' in assembly and then memcopyed some function to the right adress (using a function pointer?) to process these registers? Wow. That sounds pretty complicated. For now i just use gcc and function pointers and let gcc do the rest.

About the 1/(60 * 700) what does that number even mean to you? 1/60 is the usual refresh rate in seconds, about 16 ms, why do you multiply this with 700, this makes no sense?

vardump · on June 16, 2016

You said earlier:

> Graphics in a minimum 60 fps context is about 700 times less time sensitive.

16.6 ms (60 fps) / 700 = ~24 microseconds.

> So you put in the parameters into registers 'per hand' in assembly and then memcopyed some function to the right adress (using a function pointer?) to process these registers? Wow. That sounds pretty complicated. For now i just use gcc and function pointers and let gcc do the rest.

I concatenated blocks of compiled code together (compiled as position independent), avoiding branches and indirect calls.

Loop setup block took care of putting right parameters to right registers.

Hardest part was getting the filter parameter passing right. Dynamic configuration was trivial after that.

There were no function pointers except one to trigger the top level loop.

iammyIP · on June 16, 2016

My original calculation was that in audio you need to deliver 44000 values per second, in graphics you need 60 - 44000/60 roughly equals to 700. That's the difference factor :) (of course the data volume per delivery and it's techniques differ much, thats why i suggested that these are very different problems) It's very simple, maybe we got caught up into some misunderstanding.

For your concatenated blocks of precompiled code maybe you should have used simple function pointers, and one monolithic precompiled block to achieve the same result like i originally suggested.

vardump · on June 16, 2016

> For your concatenated blocks of precompiled code maybe you should have used simple function pointers, and one monolithic precompiled block to achieve the same result like i originally suggested.

That's how the original version was like, but it was way too slow. I achieved eventually about 10x faster performance for typical (dynamic) configurations.

iammyIP · on June 17, 2016

Unless you don't provide a code example and/or dev chain description i'm unable to follow you.