I've spent quite a lot of time working with the Web Audio API, and I strongly agree with the author.
I got pretty deep into building a modular synthesis environment using it (https://github.com/rsimmons/plinth) before deciding that working within the constraints of the built-in nodes was ultimately futile.
Even building a well-behaved envelope generator (e.g. that handles retriggering correctly) is extremely tricky with what the API provides. How could such a basic use case have been overlooked? I made a library (https://github.com/rsimmons/fastidious-envelope-generator) to solve that problem, but it's silly to have to work around the API for basic use cases.
Ultimately we have to hold out for the AudioWorklet API (which itself seems potentially over-complicated) to finally get the ability to do "raw" output.
I will second this. I wanted to make a live streaming playback feature using the API so I could remotely monitor an audio matrix/routing system that I have in the office.
The API has _zero_ provision for streaming MP3. You either load and playback a complete MP3 file or you get corrupted playback because the API simply won't maintain state between decoding calls.
What I ended up having to do was write a port of libMAD to JavaScript and then use that to produce a PCM stream, which I _could_ then convert into an AudioBuffer, attach a timer, and then send into the audio API for correct playback.
Which is an insane amount of work for a gaping oversight in a common use-case of the API, a simple flag in the browsers native decoder would've sufficed.
Did you look into Media Source Extensions[0,1]? Fetching and playing the various audio formats is a bit outside the purview of Web Audio. But you can feed streaming MSE into Web Audio. If I recall, you use Web Audio's `AudioContext.createMediaElementSource()` to use a (potentially chunked) MSE source with web audio, but it's been a while since I did this.
That said, Media Source Extensions (MSE) is only supported on relatively modern browsers (IE11+) but you should be able to use it to stream mp3 to the Web Audio API on supported browsers.
There's also a way to do this without using MSE for older browsers. See the 72lions repo below for an example[2]. It's a bit convoluted, but not as much work as your workaround. As described in the README of the 72lions proof-of-concept:
"The moment the first part is loaded then the playback starts immediately and it loads the second part. When the second part is loaded then then I create a new AudioBuffer by combining the old and the new, and I change the buffer of the AudioSourceNode with the new one. At that point I start playing again from the new AudioBuffer."
I think it's more like "fetching and displaying various image formats is outside the purview of HTML5 canvas".
If you want to just show an image, you use an <img> tag, or just play an audio file you use <audio>. Canvas and the Web Audio APIs are for pages that want to make or mix their own images/audio. Though to be fair, html/javascript do make it easy to load image data from an image tag directly into a canvas; maybe there's a missing parallel for audio.
If I recall, as we did that project a year and a half ago, MSE either wasn't available at that time or the latency was entirely unacceptable. I should have noted that with the setup I described above we are able to achieve <150ms of latency in most cases; and as the system also allows remote control of matrix sources and mixers, the low latency was required in order to accurately manipulate the system under certain working conditions.
The MP3 issues don't end there, which is something the article touches on obliquely: you can't reuse many of the important constructs you might want to.
Here's my use case. I have a couple of games (https://arcade.ly/games/starcastle, https://arcade.ly/games/asteroids), each of which has three pieces of music: title screen, in game, and game over. If you play the game a couple of times you're going to hear the title screen audio probably once, in game twice or more (because it loops from the beginning after every playthrough), and game over twice. To put it simply: I need to play the same MP3s multiple times each.
To play an MP3 you have to decode it, which is an expensive operation. Firstly it takes time to decode - enough time that the user will notice the lag even on a fast machine. However the main problem is the amount of memory use: decoding takes you from a couple of MB of compressed MP3 to potentially hundreds of MB of uncompressed audio. The problem worsens for multiple tracks.
I discovered the memory issues via Chrome Task Manager, when I noticed my page using hundreds of MB of native memory, and traced this usage back to the music. You can often get away with this when running on a desktop browser, but not so much on mobile.
You can mitigate the memory issue to some extent by dropping the sample rate of your uncompressed PCM audio to 22.05KHz, which obviously halves its uncompressed size. Quality starts to suffer too much for music if you go much below this though. (Note here that I'm talking about the uncompressed sample rate, and NOT the MP3 bitrate. A 44.1KHz MP3 encoded at 64Kbps and one encoded at 128Kbps will decompress to the same size, although the 64Kbps version will obviously sound worse because more information will have been lost.)
But the inability to reuse a source buffer, which holds compressed audio, is absolutely aggravating, and something I've posted at length about here: https://github.com/WebAudio/web-audio-api/issues/1175. The reason you might want to do this is because it means you're only using as much memory as the compressed audio takes up and (hopefully) the rest will have been freed by the browser's runtime (no guarantees, obviously).
The downside of this approach is that you can't start a piece of music at a defined instant, which is extremely frustrating when you might want to synchronise it with events happening on screen.
Also, due to the re-decoding every time, and the asynchronous nature of such, I've now introduced a weird bug where it's possible to end up with both title and in game music playing at the same time if the user starts the game before decoding the title music is complete. It's fixable (although I haven't had time yet), but it's just one more irritation with a poorly designed API.
I'm actually thinking of going back to using the good old HTML5 AUDIO element just for playing music, since it seems a bit more reliable, but I need to do some experimentation to see what the memory impact is. I also had issues with AUDIO misbehaving quite badly in Firefox with multiple sounds playing simultaneously.
Sound effects are less of an issue because they're obviously quite short and therefore don't take an excessive amount of memory even when uncompressed, so I can at least keep buffer sources around for them. Nonetheless the API's excessive complexity shows through even here: why is it such a drama just to play a sound? Why do I need to create and connect a bunch of objects together just to play a single sound at a given volume? Ridiculous. Asinine.
I tend to think of the Web Audio API as the answer to the question: "how much of an audio API can you have if you stipulate that all user-specified code must run in the UI thread?".
Within that constraint I don't think it's a terrible API, but it's a big constraint and naturally raw access would be far preferable.
Yes.. after I wrote my comment I was feeling a bit bad for sounding like I was just trashing the API. In a world where JS is slow and there is no worker thread machinery, yet you need low latency and flexible processing, the design makes more sense.
That being said, the AudioParam "automation" methods still make me want to cry.
Yeah, AudioParam's refusal to interpolate anything makes it really hairy to work with.
Comments on the spec suggest that there was something really complicated about the "cancelAndHold" method (which I guess is still in NYI limbo), but I can't for the life of me figure out what it was.
The API is frustrating because it is meant to hide the fact that Android audio sucks giant hairy donkey balls.
If you give Web developers access to raw samples, they are going to expect it to work. When it doesn't on Chrome on Android, lots of people are going to start complaining and filing bugs.
So, instead of fixing the audio path, they decided to bury its crappiness under a "higher-level" API which has fuzzier latency and can be built with hacks in the audio driver stacks themselves.
Android audio is truly terrible for instrument apps. I don't understand how it suffices for things like games. I also don't understand why people even bother to make things like pianos and drum set… The latency is so extreme and inconsistent that even on recent phones they are useless. In contrast, iOS has had excellently playable instruments at least as far back as the iPod Touch 4.
But Chrome for Android didn't come out until 2012 and Chris Rogers started the Web Audio work in 2009. I think someone would have had to have been exceptionally farsighted to think "Android's audio stack is going to suck for several years so we need to design around that now".
At that point in time, though, you could be forgiven with "Sheesh. Javascript is so painfully slow that nobody will ever pass PCM samples around in it."
So, at every point in time up to and including now, you've always got something resisting low-latency PCM. Android is just the latest reason.
Side note: it looks like Chris Rogers bowed out of Web Audio about 2012/2013 timeframe.
AAudio is a new C API. It is designed for high-performance audio applications that require low latency. It is currently in the Android O developer preview and not ready for production use. (Jun 2017)
Until this changes, the media apis will lag as G attempts to maintain parity with other orgs. Google makes product for Google devs and incidentally for the world to use.
This demo doesn't keep a straight 120 BPM on my machine, it's incapable of holding the rhythm after 10 seconds of playback ( I tried the first patch on the left , Edge browser).
That's unfortunate. I don't have a machine running Edge to test it on. It uses less than 20% of the CPU in Chrome on my 2012 Macbook Air. It's possible that it's a problem with my code, but in general the Web Audio API does not have very good cross-browser support.
Thanks! After a certain point it felt like a dead end to me, so I dropped it in favor of exploring something more along the lines of a JS-based Max/MSP. But it is surprising how much fun can be had with the small number of modules available in Plinth.
My first lesson in this was the Roland MPU-401 MIDI interface. It had a "smart mode" which accepted timestamped buffers. It was great... if you wanted a sequencer with exactly the features it supported, like say only 8 tracks. It was well-intentioned, because PCs of that era were slow.
The MPU-401 also had a "dumb" a.k.a. "UART" mode. You had to do everything yourself... and therefore could do anything. It turned out that early PCs were fast enough -- especially because you could install raw interrupt service routines and DOS didn't get in the way. :)
As a sequencer/DAW creator, you really want the system to give you raw hardware buffers and zero latency -- or as close to that as it can -- and let you build what you need on top.
If a system is far from that, it's understandable and well-meaning to try to compensate with some pre-baked engine/framework. It might even meet some folks' needs. But....
IIRC games made pretty good use of MPU-401 intelligent mode to drive the MT-32 module. The first really elaborate game scoring work on the IBM platform came through the MT-32(Sierra picked it up and everyone else followed - it was a good target for composers but in practice most people heard the music on Adlib/SB), so I would consider it successful in that niche.
And on that note, what I think Web Audio tried to be was a drop-in kit for game engines. Getting the full functionality of Unreal into the browser motivated the requirement for audio processing. But the actual implementation was muddled from the start: basic audio playback remains challenging(try to stream a BGM loop instead of load+uncompress and discover to your woe that it's not going to loop gaplessly, even when the codec is designed to allow that.) and my hobby stab at an independent implementation ran out of gas when I tried to get their envelope model working. The spec has a lot of features but not enough detail, and my morale sank further when I looked at how Chrome did it(stateful pasta code). I got something half-working, put it aside and never came back.
OTOH I had also tried Mozilla's system. That was very simple, and I got a synth working in no time at all with decent performance and latency. Optimizing from that point would have been the way to do it, but something in browser vendor politics at that time led to it being dropped.
How do you feel that MIDI is still with us today, and still prone to mis-use!? I mean, in the context of the original article, its quite clear that there is a deeper lesson to be learned .. especially when compared with a near-40 year old technology which is still in use today.
(If I could return to Cakewalk, I would. Wrote some of my best tracks with that little ISR of yours!)
Here's my take on the history here:
http://robert.ocallahan.org/2017/09/some-opinions-on-history...
From the beginning it was obvious that JS sample processing was important, and I tried hard in the WG to make the Web Audio API to focus on that, but I failed.
Back there I followed a bit the discussion and your alternative spec, which was really interesting. If I remember it well, that will take lots of work and you were the only one working to get it implemented on FF. Are there any plans to get back to that? Maybe as an independent API for JS sample processing by workers only, in parallel with WA? Congratulations on your past efforts and thanks in advance for your answers.
I've been heavily into procedural audio for a year or two, and have had no big issues with using Web Audio. There are solid libraries that abstract it away (Tone.js and Tuna, e.g.), and since I outgrew them working directly with audio nodes and params has been fine too.
The big caveat is, when I first started I set myself the rule that I would not use script processor nodes. Obviously it would be nice to do everything manually, but for all the reasons in the article they're not good enough, so I set them aside, and everything's been smooth since.
So I feel like the answer to the articles headline is, today as of this moment the Web Audio API is made for anyone who doesn't need script nodes. If you can live within that constraint it'll suit you fine; if not it won't.
(Hopefully audio worklets will change this and it'll be for everyone, but I haven't followed them and don't know how they're shaping up.)
The problem is not that Web Audio doesn't do useful things. The problem is that it's a terrible foundation to build applications on top of, because it only solves a tiny set of use cases. This results in way too many people needing script nodes.
Other proposals for audio APIs solved a wider set of use cases, while also making it possible to do procedural audio without depending on browser vendors to implement key features for you.
For music software you would need script nodes right away.
There isn't one single way to implement filters, compressors, etc. Perhaps that's not the focus of the API.
If you're a creative mind and you constrain yourselves to the effects available in Web Audio, I'm sure you'll be just at home.
The effects are useful in one setting: hobbyist and toy usage, where you really don't have that many constraints and can play with whatever cool things are around. That said, I'm sure you'd actually get a lot more mileage out of a library of user-made script nodes, rather than whatever the browsers have built for you.
If you're trying to build something production-ready, or port an existing system to the web, most of the fun toys seem like just that: toys.
AudioWorklets don't look like they would improve things for me, but that's a topic for another blog post.
I didn't say I was making trivial toy that isn't production ready. :| I just said I'm not using script nodes, and I think that's what TFA boils down to - half of it is about script nodes not being usable and the other half is about sample buffers not being suitable replacements for script nodes.
And obviously not having raw script access isn't a good thing. Nonetheless, the other nodes mostly work as advertised, in my limited experience so far, so the stuff that you'd expect to be able to do with them (e.g. FM/AM synthesis) seems to work pretty well.
> AudioWorklets don't look like they would improve things for me, but that's a topic for another blog post.
AFAIK worklets are supposed to be script processor nodes that work performantly. They wouldn't solve the sample rate problems mentioned in TFA but apart from that I'd think they should be pretty usable if they someday work as advertised.
Agreed that you can do much more than "trivial toys" with the current WAAPI! But you can only do a small part of FM without feedback (unless you're just talking about vibrato as opposed to canonical FM synthesis). Look at the modulation paths (aka algorithms) of original Yamaha FM synths...
Stopped reading at: "Something like the DynamicsCompressorNode is practically a joke: basic features from a real compressor are basically missing, and the behavior that is there is underspecified such that I can’t even trust it to sound correct between browsers. "
> Which are indeed the basics that you need and totally enough for most use cases.
However, I can take your "simple" compressor and swap it out of my audio chain for a more complex one if I need to.
I can't do that for the Web Audio API. That's really what everybody is complaining about.
The problem is that if your use case only covers 95% and I use 10 pieces, I am practically guaranteed to have a mismatch for multiple pieces--and I can't escape.
Regardless of the other valid reply to this question, the implication that sidechain is a fundamentally basic thing for a compressor is questionable. Sidechains are extremely useful for many cases, but there's clearly tons of applications of compressors that don't use sidechains. It's not like a sidechainless compressor is unusable.
I always thought some browser vendors who own mobile app stores wouldn't appreciate gamers having access to a distribution channel for great games on their platform that they didn't control. You can't have great games without great sound, so their mucking up the Sound API would be a nice way to stall the emergence.
It's a conspiracy theory, I know. Reality is probably far more boring and depressing. :/
Like the blog poster, I cut my teeth on the Mozilla API, and I was able to get passable sound out of a OPL3 emulator in a week's time. Perhaps Mozilla could convince other browser vendors to adopt their API in addition to Web Audio API?
My theory is that Google used it's influence to hinder the API so they could work around the problems with Android's audio stack. They pushed for an API they knew they could get to work on Chrome for Android, rather than fixing Android (which is supposedly improved in 8.0).
Some person working at Google didn't know or care about Android? It doesn't seem all too unlikely that while he personally didn't care, his corporate overlords told him to work within the constraints of Android.
Chris Rogers started implementing his API in 2009, about three years before Chrome for Android was first released.
And do you seriously think in 2009 some corporate overlord said to Chris Rogers, "Android is going to be big, and so is Chrome for Android, but we've decided Android will have a crappy audio stack for several years, so you need to design around that"?
I tried making a simple Morse code trainer using the Web Audio API, which seemed perfectly suited to the task, but I ran into two major problems:
1. Firefox always clicks when starting and stopping each tone. I think that's due to a longstanding Firefox bug and not the Web Audio API. I could mostly elminate the clicks by ramping the gain, but the threshold was different for each computer.
2. This was the deal-breaker. Every mobile device I tested had such terrible timing in JavaScript (off by tens of milliseconds) that it was impossible to produce reasonably correct-sounding Morse code faster than about 5-8 WPM.
I found these implementation problems more frustrating than the API itself. At this point I'm pretty sure the only way to reliably generate Morse code is to record and play audio samples of each character, which wastes bandwidth and can be done more easily without using the Web Audio API at all.
> Firefox always clicks when starting and stopping each tone. I think that's due to a longstanding Firefox bug and not the Web Audio API. I could mostly elminate the clicks by ramping the gain, but the threshold was different for each computer.
You sure it is not due to the sound files you are using not having a normalized start?
There were no sound files. I used an oscillator and changed the gain with `setTargetAtTime` to briefly fade in and out. That should prevent any clicking, but in Firefox it required an excessive amount of time.
Is that actually a bug? If you start a tone instantly without ramping the volume, the first sample should be relatively loud, which will click. It is reasonable to want and even expect a different behavior, but it might not be what is specced.
Yes, to clarify, I was using `setTargetAtTime` to ramp the volume, but it was still clicking in Firefox without an extra hack. At the time I found a bug report that appeared to point to the cause (something related to internal audio buffers that Firefox uses), but I'm not finding it at the moment. It could have been fixed.
This article focuses on emscripten examples and for good reason! The effort to resolve the differences between OpenAL and Web Audio has been on-going and exacerbated by Web Audio's API churn, deprecations and poor support.
I'm one of the authors of this PR, and yes, WebAudio's baffling lack of proper consecutive buffer queuing has been no small source of frustration. They seem to have put so much effort into adding effects nodes and other such things, but something as simple as scheduling one sound to play gaplessly after another can't be (easily) done. Requests for such support have been batted aside as unnecessary, which is funny considering where all the effort is going instead.
To do it properly would require just giving up on WebAudio's features completely and doing all the mixing in software via WebAssembly. Honestly though, if you're going to do that, you may as well just compile OpenAL-Soft with emscripten and use that, so I opted to just try to get the best out of WebAudio that I could. Hopefully it's good enough.
I put some weekends into trying to build a higher-level abstraction framework of sorts for my own sound art projects on top of Web Audio, and it was full of headaches for similar reasons to those mentioned.
The thing that I put the most work into is mentioned here, the lack of proper native support for tightly (but prospectively dynamically) scripted events, with sample accuracy to prevent glitching.
Through digging and prior work I came to a de facto standard solution using two layers of timers, one in WebAudio (which support sample accuracy but gives you hook to e.g. cancel or reschedule events), and one using coarse but flexible JS timers. Fugly, but it worked. But why is this necessary...!?
There's a ton of potential here, and someone like myself looking to implement interactive "art" or play spaces is desperate for a robust cross-platform web solution, it'd truly be a game-changer...
...so far Web Audio isn't there. :/
Other areas I wrestled with:
• buffer management, especially with CORS issues and having to write my own stream support (preloading then freeing buffers in series, to get seamless playback of large resources...)
• lack of direction on memory management, particularly, what the application is obligated to do, to release resources and prevent memory leaks
• the "disposable buffer" model makes perfect sense from an implementation view but could have easily been made a non-issue for clients. This isn't GL; do us some solids yo.
I had a discussion on Twitter recently about a possible use case for WebAudio - and that was a sound filters - in pretty much the same way as Instagram popularised image filters for popular consumption.
One thing that really irks me at the moment is the huge variation in sound volume of the increasing plethora of videos in my social media feed. If there was some way we could use a real time WebAudio manipulation on the browser to equalise the volume on all these home made videos, so much the better. Not just volume up/down, but things like real time audio compression to make vocals stand out a little.
Add delay and reverb to talk tracks etc. for podcasts.
EQ filters to reduce white noise on outdoor videos etc. also would be better. People with hearing difficulties in particular ranges, or who suffer from tinnitus etc. would be able to reduce certain frequencies via parametric equalisation.
It would be intriguing to see a podcast service or SoundCloud etc. offer real time audio manipulation, or let you add post processing mastering effects on your audio productions before releasing them in the wild.
The huge set of deficiencies in this API were communicated to the designers from the very beginning, and unfortunately most of them went unresolved for a long time (or indefinitely). It's a real bummer.
For a while there was a huge footgun that made it easy to synchronously decode entire mp3 files on the ui thread by accident. Oops (:
Even better, for a while there was no straightforward way to pause playback of a buffer. It took a while for the spec people to come around on that one, because they insisted it wasn't necessary.
I'm running a SaaS built on the back of the Web Audio + WebRTC apis. While it isn't perfect at all, it is still pretty impressive what progress has been made in the last few years allowing you to do all kinds of audio synthesis and processing right in the browser. It seems to me that it is a pretty general purpose api in intent. The approach seems to be to do the easy low hanging fruit first and then get to the more complicated things. This doesn't satisfy any single use case quickly but progress is steady. No doubt it would be nice if it was totally capable out of the gate but I'm simply happy that even the existing capabilities are there. Be patient, it will improve vastly over time.
EDIT: I should also add that the teams behind the apis are quite responsive. You can make an impact in the direction of development simply by making your needs/desires known.
I worked on a (now abandoned) project a while back using Web Audio API, but it was NOT for Audio at all - in fact, it was to build a cross platform MIDI controller for a guitar effects controller.
As someone mentioned elsewhere on this thread Android suffered from a crappy Audio/MIDI library. iOS's CoreMIDI was great, but not transportable outside of iOS/OSX. Web Audio API's MIDI control seemed a great way to go - just build a cross platform interface using Electron App and use the underlying WebAudio to fire off MIDI messages.
Unfortunately, at the time of developing the project, WebAudio's MIDI SYSEX spec was still too fluid or not completely defined, so I had trouble sending/reading SYSEX messages via the API, and thus shelved the project for another day.
It's more about expressing events in time than audio necessarily... Sometimes it's used just to keep other devices in sync with a tempo, sometimes it's used to control lights, and -sometimes- it tells an actual audio synthesizer when to start and stop making noise.
In 99% of occurrences, yes, Audio and MIDI can be intertwined, but in this particular project, I was using MIDI CC and PC messages to change preset and parameter settings on a rack mounted effects processor.
Oh, and we needed to use SYSEX a LOT in order to intercept clock timing messages, as well as complex data like preset names and multi parameter effect settings (EQ etc.). None of the messages sent/received affected music notes at all - it was all setting configuration only.
Not really, the full range of human hearing is over 120db. Getting to 120db within 16 bits requires tricks like noise shaping. Otherwise, simple rounding at 16 bits gives about 80db and horrible sounding artifacts around quiet parts.
It's even more complicated in audio production, where 16 bits just doesn't provide enough room for post-production editing.
This is why the API is floating-point. Things like noise shaping need to be encapsulated within the API, or handled at the DAC if it's a high-quality one. (Edit) There's nothing wrong with consumer-grade DACs that are limited to about 80-90db of dynamic range; but the API shouldn't force that limitation on the entire world.
In the same vein, sure, the audio production nodes should use floating point, but for simple playback, which I'd argue is the 90% case, it shouldn't require me to use floats. Real-time audio toolkits like fmod and wwise all work in fixed-point formats on device, because the cost of floats is too expensive for realtime audio.
The floats are only required if you have a complex audio graph -- with a sample-based API, you can totally do the production in floats, and then have a final mix pass which does the render to an Int16Array. All in JavaScript.
> because the cost of floats is too expensive for realtime audio
Round(Sample * 32767) is really that slow?
If you're doing integer DSP, you still need to deal with 16 -> 24, or 24 -> 16 overhead; and then the DAC still is converting to its own internal resolution. (Granted, 16 <-> 24 can be simple bit shifting if aliasing is acceptable.)
I think the whole point is that Javascript used to be slow, and using the CPU as a DSP to process samples prevents acceleration. Seems to me what is needed is like "audio shaders" equivalent to compute/pixel shaders, that you farm off to OpenAL-like API which can be compiled to run on native HW.
Even if you grant emscripten produces reasonable code, it's still bloated, and less efficient on mobile devices than leveraging OS level DSP capability.
Modern CPUs actually make decent general purpose DSPs for audio, so I'm not sure what kind of hardware acceleration you expect. Parallelization? Audio processing is not as extremely parallelizable as video. It's somewhat parallelizable. In a classic audio filter function, each sample affects the sample that comes immediately after it, so it's not embarrassingly parallel. The best you can do is, for example, parallelize multiple independent filters, so CPU SIMD instructions often turn out to be a decent fit.
As a side note, for some common audio DSP tasks, you could presumably take better advantage of highly parallel processing by doing a fourier transform and working in the spectral domain. There has been research do do this on GPUs and it works. However, if you do this you'll have high latency, and it's not a hardware problem, it's inherent to the FFT algorithm, so it's kind of a dead end for many applications.
While it sort of looks like that, I don't know of any non-software implementations of the API (they use some SIMD at most). The problem is that 48,000 samples per second really isn't that much, even with very complex processing. When you compare it to a 1080p screen which has to process at least 373,248,000 samples per second, it hardly seems worth it to even spin up a shader.
It is more or less based on OS X audio, yes. The author of the Web Audio spec previously was an architect on Core Audio at Apple. He basically moved over to Google, implemented his chosen subset of Core Audio in webkit, and shipped it prefixed. Then the evangelism group got big players like Rovio to ship apps that depended on the half-baked prefixed API so it was the de-facto standard for game SFX on the web.
Interesting! It's surprising to me that Apple (or Google?) is behind such a poor API, after Apple did such a good job with both Canvas and CSS animation.
Even 'idiomatic' Javascript is plenty fast to generate audio samples, not to mention asm.js or WebAssembly. The latency / non-realtime nature of the 'browser loop' is the main problem (not being able to generate new sample data exactly when it is needed).
"BufferSourceNode" is intended to play back samples like a sampler would. The method the author proposes of creating buffers one after the other is a bizarre solution.
I picked a 440Hz sine wave because I didn't want to write a more complex demo example, knowing full well someone would nitpick this.
Please use your imagination and try to imagine one of infinitely many other streams that I could make at runtime that are not easily made with the built-in toy oscillators.
It's a higher level API and you're deliberately ignoring all of its higher level features and concentrating on the part that clearly is underdeveloped. Maybe you should use your imagination instead of putting a square peg in a round hole?
I think you're missing the point entirely. It's like a modular synthesiser. It's not "serious business" but this is the browser after all.
Plug a few oscillators into each other and you have an FM synth. Feed delays into each other, etc, etc. You can do that in a few lines of code with no dependencies.
To me, that's a huge potential audience.
If you want a array of samples and depend on dozens of JS libs for functionality, well, I'm sure the AudioWorkers will catch up eventually.
The ability to build primitive synths in a few lines of code (w/o library dependencies) is fine and well, but should not have been a priority for becoming a web standard. What's desperately needed is a well-designed low-level API. That could have been done years ago, and then if there was still sufficient demand for built-in nodes, those could have been added later.
As far as potential audience, in the time I've spent lurking in the Web Audio community, it seems like developers fall into one of two camps: 1) building toy projects for their own edification/learning, and happy to have the Web Audio API 2) trying to build a serious product (DAW, game, whatever) and super frustrated with the API. It seems pretty clear to me that end-users would be much better off if camp 2 had a good low-level API to work with.. camp 1 is not making much that gets used by end-users.
> It's not "serious business" but this is the browser after all.
Modern JS performance is actually quite good, and WebAssembly is only going to make it better. I think you underestimate the potential of audio processing in the browser.
> It's like a modular synthesiser.
I own hardware modular synths, and I built a proof-of-concept modular synth environment using the Web Audio API (https://github.com/rsimmons/plinth). The API makes it hard to build even simple things like well-behaved envelope generators or pitch quantizers. So even if you viewed the API as a sort of code-level modular synth environment, it's pretty unsuited to anything beyond trivial use cases.
The browser-based experimental/modular audio stuff that has any traction (e.g. https://github.com/charlieroberts) doesn't use the built-in nodes for these reasons.
> It's not "serious business" but this is the browser after all.
This answer would have made sense to me circa 2003, but I cannot fathom it today. The web started as "let's put academic papers online". It moved to "let's put magazines online" with some modest interactivity via forms.
But we've spent the last 10 or 15 years turning the web into a "you can do anything" platform. There's been huge progress in interactivity and visuals. There's no a priori reason audio should lag so far behind.
There has been huge progress in visuals, but the web is basically still a document and content delivery system decorated with a few animation features, not a full-fat creative multimedia OS.
And that could be because there are institutional forces keeping it at a certain level of clunkiness, which is far short of the requirements of professional media creators who work with video, 3D, and sound.
I suspect the real problem is that the walled-garden corporates don't want the web to compete with their lucrative app farming operations.
Unless that changes, the creative edges of the web will remained dumbed down.
And it probably won't change. Ever.
If that's correct, the original question has a simple answer: Web Audio is designed to meet the corporate requirements of Apple and Google, not the needs of web users or web developers.
If Google really wanted to limit the power of Web Audio in order to boost Android apps, you would think they would have made a usable Android audio API as well...
> Please use your imagination and try to imagine one of infinitely many other streams that I could make at runtime that are not easily made with the built-in toy oscillators.
Somebody already did. Check out Fourier Theory. The oscillators (well actually just sin, the rest will give you some help as well) can be used to make any stream, technically.
Ah yes, just run a square wave at twice the range of human hearing, then use a combination of filters to extract the desired ranges. Unfortunately you'd also need an infinite arrangement of filters.
I propose a challenge then. I'll give you an arbitrary stream in the format of a .wav, and you send me code, in JavaScript, that uses an series of infinite OscillatorNodes that reproduces that stream completely accurately.
I won't even make you write the code that generates the stream, I'll just require that one sample.
If you can do it, I'll give you $1000 out of my own pocket.
If you're not sampling at an infinite sample rate, you don't need an infinite series of nodes.
Given Nyquist, a non-infinite series of [non-trivial power of 2] nodes will do the job just fine.
Audio FFT/iFFT processing typically uses 512 or 1024 bins, although sometimes you can get away with fewer.
In practice, iFFT doesn't use discrete oscillators for resynthesis, because the whole point of iFFTs is to limit the amount of work you have to do.
But there's no reason in theory a limited number of oscillators couldn't do the job. (I've done resynthesis like this in SuperCollider when I wanted special effects that FFTs can't produce.)
I doubt WA would be fast enough on most machines, but it would probably be possible - if rather dumb - on powerful hardware.
I'm well aware of discrete fourier transforms, and that we're dealing with a ~20KHz bandlimited signal. My original "infinite series" reply was a joke poking at the ridiculous idea of using iFFT for sample playback that seemed to get interpreted as a serious response by kruhft.
I mean, this is like saying "I can draw a lighted 3D cube with these 10 lines of OpenGL immediate mode, what do I need this 500 line Vulkan example for". And that's true!
It also misses the point, because, as with Vulkan, you just want a stable, sane, fast low-level API to access the hardware because OpenGL immediate mode doesn't get you beyond kindergarten in todays computer graphics. In audio, that is a sample-level API. Everything else should be handled by the application!
You can still make a source/sink directed graph system with components like "oscillators". In a fricking library!
> The method the author proposes of creating buffers one after the other is a bizarre solution.
I used the same solution when I tried to perform realtime audio streaming from a daemon on an embedded device to a browser (which is probably even a more realistic use-case for a browser audio API than generatic sine waves). I basically stumbled over the same issues than the author: A deprecated ScriptProcessorNode and high-level APIs which don't help me (like the oscillator one).
In the end I opted for a very similar solution as the author: Whenenver I got enough samples through websocket (I encoded them simply as raw 16bit samples there) I created a BufferSource, copied all samples in there (with conversion to floating point), and enqueued the buffer for playback at the position where the last buffer finished.
I really didn't expect that to work well due to all the overhead of creating and copying buffers and due to the uncertainity whether the browser will switch between 2 buffers without missing samples. But surprisingly it worked and did the job. I included a buffering of 200ms, which means I only started playback 200ms to be able to receive more data in the background and have a little bit more time to append further buffers. I experimented a little bit with that number but can't remember how deep the lower limit was before getting dropouts regurarly. It definitely wasn't usable for low-latency playback.
Just from skimming the spec, the AudioWorklet interface looks very close to what is needed to build sensible, performant frameworks for audio profs and game designers.
So the most important question is: why isn't this interface implemented in any browser yet?
That a BufferSourceNode cannot be abused to generate precision oscillators isn't very enlightening.
> why isn't this interface implemented in any browser yet
Partially because in addition to the interface itself it relies on a bunch of generic worklet machinery which also doesn't exist in any browser and is not trivial to implement in non-sucky ways.
But also partially because the spec has kept mutating, so no one wants to spend time implementing until there's some indication that maybe that will stop.
For bonus points, prep work done for the eventual rollout of AudioWorklet in Chromium shipped a bug to release channel Chrome that breaks all uses of Web Audio. The bug wasn't caught in beta/canary channels because it only affects some user machines, and they can't revert the bug because of architectural dependencies. A basic way to summarize it is that AudioWorklet required the threading structure of Web Audio to change for safety reasons, and this results in a sort of priority inversion that can cause audio mixing to fall behind forever. Even simple test cases where you play a single looping sound buffer will glitch out repeatedly as a result.
So basically, Web Audio is unusable in release Chrome on a measurable subset of user machines, for multiple releases (until the fix makes it out), all because of AudioWorklet. Which isn't available yet.
I am being a little unfair here, because this bug isn't really the fault of any of the people doing the AudioWorklet work. But it sucks, and the blame for this horrible situation lies largely with the people who designed WebAudio initially. :(
> But it sucks, and the blame for this horrible situation lies largely with the people who designed WebAudio initially
Just skimming it a bit, it seems like they tried to make the same kind of "managed" framework for an audio graph that SVG spec does for vector graphics. And even if SVGs are janky the static image still succeeds in serving a purpose. But if you get dropouts or high latency in audio, there isn't much more it can be used for. (Aside from browser fingerprinting :)
>> Can the ridiculous overeagerness of Web Audio be reversed? Can we bring back a simple “play audio” API
To be frank, graphics world had some type of standard (OpenGL) long time ago, next to DirectX. So WebGL had a good example. However in the audio world we haven't seen a cross platform quasi-standard spec covering Mac, Linux and Windows. So IMHO, non-web audio lacks also common standards for mixing, sound engineering, music-making. That's why web audio appears to lack a use case. IMHO, that smells opportunity.
I use Web Audio, in canvas-WebGL based games where music making is needed. I understand the issues - we definitely need more than "play" functionality.
We've been through this multiple times. WASAPI, MME, DirectSound on Windows. CoreAudio on Mac. Libraries like SDL_mixer, FMOD, Wwise. We know how to construct a sound API. There's 20 years of prior art.
If you provide a low-level "play" API, others can build stuff on top because it's just numbers. Sure, sometimes there's "expensive numbers" like MP3 decoders, FFTs, etc., but these can be added as needed.
It's fairy easy to get PCM out on any one platform (which means you can build support for Win/Mac/Linux by writing that small C code 3 times), and as Jasper_ noted, the rest is just math on some integers or floats, so there is nothing much platform specific about it.
I think the bigger issue is that non-experts sometimes get tasked with adding support for things.
The "audio device API that leaves the sample rate completely unspecified" example is, believe it or not, one I've seen before elsewhere. And yet, if you know the first thing about PCM samples, you know this is a mind-numbingly stupid mistake to make. Yet it's a mistake that a few people have made into shipping products, because they can't or won't reason about audio, and this did not stop them from being in charge of an audio API.
Whether the API could be used to play MOD files is a good litmus test of its suitability for a variety of purposes. Covers repeatedly playing samples at differing volumes and pitches, simultaneously.
I'd rather have a comprehensive API that someone can dumb down than one that's so crippled as to be unusable beyond very basic functionality.
Don't let the name fool you. OpenAL is a closed-source library, much like Wwise or FMOD or PortAudio, that just implements playback. Bizarrely enough, it is also the only one of these APIs that uses a similar "play this buffer" approach and suffers from the same issues as Web Audio's memory management, just without a GC.
The actual audio equivalent to OpenGL is OpenSL [0], which I don't think picked up any support from anybody.
PortAudio is MIT-licensed[0] and seems like a decent example of the primitives you need for audio.
Broadly low-level audio APIs are divided into 2 categories:
1. Callback-based - every time the underlying system has a new block of audio available and/or needs to be supplied with a new block, it calls your callback, which reads input data, does whatever processing you want, and writes output data
2. Stream-based - Inputs and Outputs are represented by streams. You can read from the input stream to record and write to the output stream to play back.
Both types of API can be used for low-latency audio, but you generally introduce a buffer of latency when you need to convert between them.
Portaudio lets applications choose which API they want to use.
OpenAL has multiple implementations, including the popular open source OpenAL Soft. They are not all closed source.
OpenAL does have a recording API so it isn't pure playback only.
But you are right in that the OpenAL scope is fairly limited. It was designed for games, particularly for rapid and frequent playback of simultaneous short sound effects. Because of this, the memory management issues you bring up are not often an issue. You load all the buffers you need at the beginning of the level and you keep reusing them without any more memory allocation/deallocation.
OpenSL ES was adopted by Android in 2.3 (API 9). However, they just recently seemed to invent yet another API, and seem to be leaving OpenSL behind.
So are all official OpenGL implementations (MESA isn't official, last I heard). Doesn't stop them from being a standard and being used, although I agree they would be better if they were open source.
OpenGL has to talk to hardware and is implemented by the hardware vendor. OpenAL does not have multiple implementations, isn't provided by hardware vendors and just wraps the platform audio API.
I'm not sure of your point here. Not sure why multiple implementations is a benefit for users. Audio is generic and hardware is cheap enough that the operating systems just implement and include drivers. It's a cross platform library that meets audio needs, including 3D/spatial audio, much like (and designed like) OpenGL.
I think things will get a lot better when the underlying enabling technology is in good shape. The audio engine needs to be running in a real-time thread, with all communication with the rest of the world in nonblocking IO. There are lots of ways to do this, but one appealing path is to expose threading and atomics in wasm; then the techniques can be used for lots of things, not just audio. Another possibility is to implement Worker.postMessage() in a nonblocking way. None of this is easy, and will take time.
If we had gone with the Audio Data API, it wouldn't have been satisfying, because the web platform's compute engine simply could not meet the requirement of reliably delivering audio samples on schedule. Fortunately, that is in the process of changing.
Given these constraints, the complexity of building a signal processing graph (with the signal path happening entirely in native code) is justified, if those signal processing units are actually useful. I don't think we've seen the evidence for that.
I'd personally be happy with a much simpler approach based on running wasm in a real-time thread, and removing (or at least deprecating) the in-built behavior. It's very hard to specify the behavior of something like DynamicsCompressorNode precisely enough that people can count on consistent behavior across browsers. To me, that's a sign perhaps it shouldn't be in the spec.
Disclaimer: I've worked on some of this stuff, and have been playing with a port of my DX7 emulator to emscripten. Opinions are my own and not that of my employer.
> If we had gone with the Audio Data API, it wouldn't have been satisfying, because the web platform's compute engine simply could not meet the requirement of reliably delivering audio samples on schedule.
1. I'm not convinced this is the case. From what I see, GC pauses constitute the big blockers, rather than event processing and repaints. Introducing an API that's friendlier to GC would be a huge win here.
2. We have WebWorkers. What would have prevented a WebWorker from calling new global.Audio() for the Audio Data API?
1. This is going to depend a lot on the app; doing an actual DAW is going to require some pretty heavy processing. It also depends on the performance goal. Truly pro audio would be a 10ms end-to-end latency, which is extremely unforgiving.
2. Some form of WebWorker is obviously where we're going. But does postMessage() have the potential to cause delay in the worker that receives it? (There are ways to solve this but it requires some pretty heavy engineering)
You just can't do that with the same level of tightness of rhythm on low hardware with web techs today. Flash was bad yet Flash also opened up insane possibilities on the web when it comes to multimedia applications that just can't be matched with Webtechs. ASM.js might fill the gap, but i haven't seen any equivalent yet.
I briefly tried Web Audio to implement a Karplus-Strong synthesizer (about the simplest thing in audio synthesis I guess?).
Without using ScriptProcessorNode, there was no way of tuning the synthesizer because of the limitation that any loop in the audio graph has a 128 samples delay at least.
Maybe a more "compilation-oriented" handling of the audio graphs (at the user's choice) could help overcome this?
Now step back and honestly think about which web API is actually powerful and nice to use and makes the impression that it has been carefully crafted by a domain expert.
Question: is the "point" of Web Audio to expose the native hardware-accelerated functionality of the underlying audio controller, through a combination of the OS audio driver + shims? Or is it more an attempt to implement everything in userspace, in a way equivalent to any random C++ DSP graph library? I've always thought it was the former.
Most consumer-grade audio hardware really only does playback. We've been doing software audio since around the turn of the century.
In Chrome's implementation, none of the mixing, DSP, etc. go through the hardware, and I'm more than certain that's the case for every other browser out there.
Audio controllers do at least do hardware-accelerated decoding of audio streams in e.g. H.264, though, yes?
But my question was more like: is Web Audio a mess mostly because it's an attempt to expose the features of the twenty-odd different OS audio backends on Windows/Mac/Linux, where the odd inclusions and exclusions map to the things that all the OS audio backends happen to share that Chrome can then expose?
> is Web Audio a mess mostly because it's an attempt to expose the features of the twenty-odd different OS audio backends
That is a good guess, but no. The main features of the Web Audio API (built-in nodes, etc.) are not backed by any kind of OS-level backend, it's all implemented in software in the browser. The spec design was based on what someone thought were useful units of audio processing. It's not a wrapping/adaptation of some pre-existing functionality.
If you mean AAC or MP3, which are usually used in the audio track along side H.264 in an MP4 or MP2-TS container, nobody outside of low power/embedded bothers to decode the audio codec in hardware, it's just not worth it.
Web api standardization for VR/AR is currently a work in progress. And it's been... less than pretty.
So if you've been wanting to try some intervention to make web standards less poor, or just want to observe how they end up the way they do, here's an opportunity.
>you can’t directly draw DOM elements to a canvas without awkwardly porting it to an SVG
This is not a wart, this is a security feature. Of course, it wouldn't be a necessary limitation if the web wasn't so complicated, but the web is complicated.
Just one example. The canvas API can grab the image data on the canvas. If you could rasterize arbitrary DOM nodes then you could very easily fingerprint users by, say, checking which fonts are installed. You could also load external resources such as images and iframes bypassing same-origin policy, so if your bank's website was configured incorrectly, a malicious site could steal information by taking screenshots of a canvas.
You can already draw non-same-origin images to the canvas using drawImage. This marks a special "origin-clean" flag which is checked when someone tries to call toDataURI or getImageData on the canvas [0] I would be OK if drawing any DOM node to the canvas cleared the origin-clean flag.
Kinda tagential to the thread, but what's the best book for an introduction to audio programming for an experienced, language agnostic coder (java, c, c++, obj-c, etc)?
I'm not sure about "best", but I got a lot out of Will Pirkle's two books on programming synthesizer / effect plugins (http://www.willpirkle.com/about/books/).
For a more generic guide I've heard a lot of good things about a free (in electronic form) book called DSPGuide (http://dspguide.com/). Haven't had a chance to dive into this one, though.
Not to be semantic, but that's technically incorrect. Indeed, if WebGL were to be supplanted by a lower-level graphics API, that would make a lot of people happy.[0]
As far as the author's thesis concerning the Web Audio API: I agree that it's a total piece of shit.
I've come to suspect that my phone's autocorrect functionality, HN's two-hour edit window, and my own brain routinely conspire against me to paint a picture of total idiocy.
I've said it before, I'll say it again: it exists in a vacuum, and is run by people who have never done any significant work on the web, with titles like "Senior Specifications Specialist". Huge chunks of their work is hugely theoretical (note: not academical, just theoretical) and have no bearing on the real world.
The Web Audio API was the work of the Chrome team, not the W3C in isolation. I'm right there with you as far as W3C criticism is concerned, but they don't deserve the blame in this case.
I disagree with a lot of the assertions in this blog. You have to suspend some of your expectations since this is all .js. you can't have a js loop feeding single samples to a buffer. Js isn't deterministic to that level of granularity, but overall it's fast enough to generate procedural audio in chunks if you manage the timing. If you check out some of the three.js 3d audio demos you can see some pretty cool stuff being done with all those crazy nodes the auThor is decrying. He'll I wrote a tron game and did the audio using audio node chains and managed to get something really close to the real tron cycle sounds, without resorting to sample level tweaking.. and with > 16 bikes emitting procedural audio.. I think more focus on the strengths than weaknesses is in order.. and if you really want to peg your cpu.. you can still use emscripten/webasm or similar to generate buffers, if that's your thing..
> Js isn't deterministic to that level of granularity
Why not? I linked a test app [0] in my post that generates generate PCM data on demand, and fast. It works deterministically on all the browsers. Mozilla certainly implemented AudioData back in 2011 and it was fast enough for them.
> He'll I wrote a tron game and did the audio using audio node chains and managed to get something really close to the real tron cycle sounds, without resorting to sample level tweaking
Why couldn't this be a high-level userspace library like three.js? Yes, with a lot of creative energy, you can recreate a lot of sounds, I'm willing to believe that. But I think a low-level API would have been more useful from the getgo.
I was expressing my excitement over something neat which implicitly answers the query in the title that it was designed for me. People who think wave analysis is neat. There are at least 3 other posts on here describing the basic features of wave analysis are just fine. Thanks for the down votes!
The major problem of this API is that they couldn't just copy something designed by people with actual knowledge, as in WebGL. So it was design by committee that does so much the application should handle but has so deficient core capabilities no application can rectify any of it.
Nope. Web Audio was designed almost entirely by a single person, Chris Rogers, an engineer with a long history of working on audio for Google, Apple and Macromedia[1]. Whatever Web Audio's problems, design-by-committee is not their cause.
> Web Audio was designed almost entirely by a single person, Chris Rogers, an engineer with a long history of working on audio for Google, Apple and Macromedia[1].
Who at Apple beat him with a stick to get audio right? Can we get that person to design the audio API's for the Web and Android?
(Edit: I realized that this was an unfair comment born of my frustration with Audio APIs from Google.
The real issue driving this is that audio is still a dumpster fire on Android. So, if he gives web developers access to audio samples, everybody is going to expect it to work. And, on Android, it will fail miserably. So, better to isolate audio functions, give them "fuzzy" latency which you can bury in C code drivers, and hide the fact that audio on Android is a flaming pile of poo rather than piss off even more developers and get even more bugs filed against Android's shitty audio.)
Yes, in fact the problem here is that Rogers barely considered spec input from outside (and others at Google have continued this behavior). A committee's design would've resulted in a better outcome in this case.
This explanation doesn't fit, because <audio> already solved all the scenarios you're describing. Web Audio attempts to solve other problems, and does a bad job of it.
I got pretty deep into building a modular synthesis environment using it (https://github.com/rsimmons/plinth) before deciding that working within the constraints of the built-in nodes was ultimately futile.
Even building a well-behaved envelope generator (e.g. that handles retriggering correctly) is extremely tricky with what the API provides. How could such a basic use case have been overlooked? I made a library (https://github.com/rsimmons/fastidious-envelope-generator) to solve that problem, but it's silly to have to work around the API for basic use cases.
Ultimately we have to hold out for the AudioWorklet API (which itself seems potentially over-complicated) to finally get the ability to do "raw" output.