Hacker News new | past | comments | ask | show | jobs | submit login
The human ear detects half a millisecond delay in sound (aalto.fi)
242 points by pizza on Aug 5, 2021 | hide | past | favorite | 142 comments



I guess that makes sense considering we use the delay of sounds reaching each ear separately to determine the direction sound comes from.

Assuming the speed of sound is around 340 m/s, in half a millisecond it travels about 17 cm, which, I would guess, is larger than the average distance between our ears.

So, using that rough estimation, I would cautiously extrapolate that we probably detect a delay under 0.5 ms, but I'd be interested to see what "detect" means exactly.


The headline is odd. There are populations of neurons in the auditory system—the medial superior olive—that receive input from both ears. Even in humans these MSO neurons are exquisitely sensitive to binaural differences. This is how the owl catches the mouse at night. In some species delays of 10-20 microseconds can be detected and encoded by MSO neurons even tracking up to frequencies well above 40 kHz. This is amazing when you realize that neurons cannot fire at a rate above 1 kHz. Phase locking and ensemble encoding is used.

For example, adult mice have small heads, and ear separation is merely 5-7 mm; yet this is sufficient to locate the position of an ultrasonic squeak generated by a mouse pup at 40 kHz. This is a computational feat that requires extreme temporal precision in binaural auditory processing across comparatively noisy wetware (transduction noise, phase locking error, synaptic release noise, conduction velocity smear, dendritic integration in MSO).

And doubly impressive given that the brain has no “given” time base or oscillator to define a compute cycle. We must all build and refine our own internal set of pseudo-clocks for sensory and motor systems, in order to define the cumulative temporal context in which we are embedded.

This is crucial for the mouse to quickly avoid the talons of the owl.

More on timing in brain: @robwilliamsiii (see pinned tweet).


Thanks for this super interesting comment. The complexity of life never ceases to boggle the mind.


It is my understanding that what really happens is far more impressive.

1. When the sound is coming from, say, the back left, the difference between the ears is far less than the width of the head.

2. It is not that the brain hears on one side "later", it is that the brain is constantly hearing two different things, what is registered on the right and what is on the left, simultaneously. Since there are often sounds coming from both directions at once, there is no easy way to match the sounds up.

Instead the subconscious has to hold each of the incoming sounds for some fractions of a second, and then combine the different sounds (with smell, etc) while matching up the different parts, using the differences to account for direction.

Since this per-force takes time, the subconscious meanwhile sends a "image" to the consciousness with the assumed incoming data based on whatever had happened before [not on what is happening now], while processing the current data to predict what image will be sent to the consciousness during the next incoming round.

This means we perceive a world created by the subconscious's guess of what should be happening now, based on past input. And the current input is held for processing. And STILL, the mouse is able to avoid the owl!


It should be noted that typical human conscious notion of 'now' is pretty much a figment of imagination, a strut for reasoning that falls apart when you start looking with sub-second precision. There is speculation, but also a lot of post-process coalescing that mostly hides/reconciles processing delays and time skew of sensory and non-sensory inputs.


We do much better than that, we use a combination of phase (arrival time of the sound-front if you wish at low frequencies) and amplitude to determine direction.

Good article on this:

http://alumni.media.mit.edu/~araz/sss/Sound_Localization.htm...


Yeah I'm thinking the relative phasing of the group delay is essentially having a comb filter effect on the audio, and the difference is perceived qualitatively rather than anything approximating a time delay.

For anyone not familiar, here's an example of comb filtering, where reflections interfere and you get wobbles in your frequency response (shown in an fft mid video).

https://www.youtube.com/watch?v=Amj4UevyRfU


Don't forget the transformation of the signal by the head, chest and outer ear! It's quite amazing.


It's more involved than just left/right delay. We localize sound in the vertical direction and front and back as well.

This is accomplished by the delays created between direct into the ear sound and reflections off the outer ear folds.

These reflections create "comb" filters in the audio spectrum which we learn to associate with direction. Its remarkable.

A test to prove this was so was to fill the outer ear with plasticine and perform localization tests on subjects. They could not localize sound in that condition.

The early work as I recall was at the Heinrich Herz institute in the mid 1970s.

I am suspicious that part of what this article is reporting is due to phase cancellation effects causing similar filtration that people can hear as timbre change rather than actually detecting the time delay.

(Source: My recording engineering final paper)


Now I really want some plasticine / modeling clay to attach to parts of my pinnae and experiment with my ability to localize sounds.

Have video games already emulated the spectral effects of sound direction for players using headphones, or even speaker systems with known spatial distribution? I can imagine modulating the sound coming from an in-game object to match its perceived source direction to its location relative to the player.


The problem is that each person's pinnae have a unique geometry, which makes the notches of the comb filter lie at different frequencies for everyone. As far as I know, there's no good way to determine this other than a direct measurement, which requires specialist equipment.


Creative (Soundblaster) now have a phone app that you can take a picture of your ear and it uses a machine learnt model to produce a somewhat better head related transform function based on it. It is an improvement on SBX and CMSS before it but its not perfect either.


Video games, VR, and other headphone systems do use generalized HRTFs for directional audio.

Edit: I'm currently working on a series of videos to explain sound direction and perception. Some of the code I wrote for my demos is or will be on GitHub.


I recently purchased a system from Dr Jeffery Thompson for sound healing. It creates custom binaural beats based on the body's stress response as monitored by heart rate variability). The system includes a pair of sophisticated 3D microphones that go over the ear with the pickups being located as close to the ear canal as possible. This allows them to record the sound as close as possible to what the ear hears. We use them to record the client singing their tone. It then gets disguised and wrapped back into the binaural beat so you are essentially singing to yourself. I'm eager to go out and record some natural sounds in 3D for use as background.


Link? I’m trying to picture the apparatus you’re describing but it’s not making sense. Are there headphones involved or is it just a wearable microphone?


Some of them do, yes. For example CS:GO has an HRTF model that does this. It has variability though because each one's ear is slightly different, so these models are general and may work better or worse for different people.


I think this is called HRTF. The bad news is there are many of them and only a few work for each person.

This often results in being unable to tell front/back apart, frontal sounds perceived as coming from above, etc.


Some games have it, with good headphones and when it's implemented well it's great otherwise couldn't tell if it was on or off.

Even a "brand" for it THX Spatial Audio


I remember back in the late 90s I had a BeBox and there was this cool nodes-and-connections audio program where you could add various effects. My friend mentioned this delay effect and so we rigged up a graph where one stereo channel was delayed by some number of milliseconds. The effect was pretty uncanny. Wearing headphones, the sound seemed to be coming exclusively from the non-delayed side, until you removed that headphone and the delayed side was clearly still playing.


That's built-in echo-cancelling module, allowing you to easily getting correct direction to the true source of sound


AKA the Haas effect or precedence effect, making the first sound to arrive the most important for localization, even if echoes are louder.


That's a nice estimation, but I would have imagined direction is down to whether a sound is more muffled or quiet in one ear than the other.


It's a combination of both.

The relevant keywords are "Interaural Time Difference" (ITD; this phenomenon) and "Interaural Intensity (or Level) Difference" (IID/ILD; i.e., volume).

In fact, there are a few other mechanisms too. The shape of the pinna (external ear) does some filtering that allows you to distinguish sounds that produce identical ITD/IID.

The neuroscience of this is really fascinating, and the circuits have been worked out pretty well.


To piggyback on the sibling comments:

If you only determined direction based on relative volume between your ears, you wouldn't be able to distinguish between sounds in front vs behind you.


You also wouldn't be able to do the cocktail party trick. Your brain's ability to ignore everything that isn't of the desired ITD phase shift is what allows you to selectively ignore most of the noise in the room and focus on a single conversation.


Yep that's why I added "muffled" (i.e. not 100% clear). Maybe it's worth clarifying that I wasn't trying to correct the commenter, just adding my original (naive? wrong?) belief about how this worked :)


Since you sound sincere here ... I read your comment as disagreeing with and correcting the original comment.


Oops, it wasn't intended! I think maybe the "but" in my comment makes it sound that way?


Yeah, to me it reads ”that’s a nice theory you have, but now I’ll tell you mine, which I think is the correct one”.


I just liked how they related the speed of sound to the distance between ears to explain the (possibly less than) ~0.5ms thing, and how it was nicer than the naive explanation I had previously assumed


No it is difference in phase.


Only for LF.


That doesn't need to be the case when there are reflections.


I'm not sure I follow


A reflection can easily be louder than the original sound, but the longer time-of-flight allows the ear to distinguish between the two. That's how even in a hall with echoing walls you can still pinpoint a soundsource with relatively high accuracy.


The first chapter of The Sound Book by Trevor Cox goes into great detail about this (but without getting deep into the math of it). Well worth the read for those interested in the acoustics of architecture and our perception of how it modifies the soundscape.


Another interesting one is 'the acoustical foundations of music'.

ISBN 0393090965


> in half a millisecond it travels about 17 cm, which, I would guess, is larger than the average distance between our ears

What is the speed and variability of neural signals traveling through the brain?


"Detect" for experiments like this is almost certainly an ABX test. That is you get to listen to samples A, B, and X repeatedly and are asked to decide whether X is A or B.


Its amazing how much difference .5ms can make in stereo imaging. Any decent home audio receiver has the ability to set the distance of each speaker individually so that you can compensate for physical placement constraints.

If you get a stereo pair perfectly locked in on delay to eardrums, you can produce an extremely compelling listening experience for those in a very specific region of the room.

Finding out about all this can be revolutionary for some music enthusiasts. Once you get a good listening setup (or headphones), you start going through old things to see how the "stage presence" sounds, or if you are now able to physically place each instrument in the virtual space.


My dad bought an Isuzu Impulse in 1985. It had a fancy OEM stereo system with Technics branding, and a button for the driver to toggle the imaging. The button was on the 7-band graphic EQ that game with the car. (I don't think I've ever seen another factory stereo with physical EQ sliders like that.)

It was magic, if you were sitting in the driver's seat. Toggling that button made the sound switch from "okay" to "magically spatial."

I'm pretty sure that lead to a conversation about how we locate sounds. Also wonder how it was implemented in 1985. I doubt there were DSP chips in there. What's the "simple" way of adding delay to some speakers?


> I'm pretty sure that lead to a conversation about how we locate sounds.

So, psychoacoustics is incredibly complicated. There's something like 13 different mechanisms that co-operate in sound localization.

However, the bulk of it was known quite a bit before 1985, and had nothing to do with the "spatialize me" button on a specific car stereo.

There's no simple way to add a delay to some speakers unless you're working in the digital domain. In analog you have two basic choices. With passive components, you build a ladder filter, which is as the name suggests, just a long chain of low pass or all pass filters. Each "rung" only adds group delay on the scale of a couple usec, so these get very big and expensive fast. They also suffer from accumulated imprecision issues. With active components you can create a feedback loop through an op amp. This is how guitar delay pedals work, but the more delay you have the more distortion you introduce.

Technically there's a 3rd way: extremely long wires, but that's basically never practical.

Thankfully these days everything starts out in the digital domain, so you just need a controllable fifo before the DAC. Entry level home theater receivers have had this since circa 2000.


The oldschool way is to convert the signal to the physical domain. https://anasounds.com/analog-spring-reverb-how-it-works/

Edit: re: extremely long wires, this is how some of the physical layer testing is done for networking equipment. Have rolls of hundreds of miles of fiber sitting on the ground to simulate large distances between switches.

Edit2: https://www.m2optics.com/products/fiber-test-boxes/multi-spo... :-)


Here are a variety of techniques: https://en.m.wikipedia.org/wiki/Analog_delay_line

I think a better word than "physical" domain is "mechanical" domain. Mechanical waves propagate much more slowly than electromagnetic waves.


> Technically there's a 3rd way: extremely long wires, but that's basically never practical.

Did the math, ~170km (assuming 300,000 km/s and speed of light in copper been 90%) - that's a long wire, would also have to be superconducting or very high voltage ;).

Whatever room temperature superconductors end up costing Audiophiles will be the early adopters ;).


Hell, even if the premise of the conversation was wrong, I still learned something that day :)

DSPs were a thing in the 80s, right? I guess the question is: were they so expensive that it was unlikely to find one in an OEM stereo of a mid-priced car? A sibling comment mentions that this button may have just been a stereo expando kind of thing, rather than localizing the sound stage through signal delays. I'm thinking that may be right, and that I'm misremembering the feature. It would be awesome to find some original owners manuals for the 1985 Impulse that have any mention of this feature. My DDG-fu is failing me right now.

Now, which one of those 13 mechanisms is failing on me when the damn cricket keeps "moving" around the room as I try to follow the chirp?



DSPs were around, but there's no way that car stereo was digitizing the signal, delaying it, then converting it back to analog.

I get it was an impressive experience, but it's essentially certain it's what the other poster said: just boosting the out of phase content between the channels. This was a very in vogue effect at the time. I remember listening to the top 40 on the radio one time and Madonna had some new song where they were hyping it as surround sound and turning it into a whole event. In any case this effect can be more effective than you might assume. After all, the first consumer version of Dolby Surround was just this out of phase content run through a bandpass filter and sent to surround speakers.

I knew a family friend with the 90s version of the Impulse. Neat quirky car from what I remember. As a kid I definitely thought it was very cool.


Sounds like q-sound. There was a Rodger Waters album mastered with it (Amused to Death) that has a dog barking way outside the normal sound stage.


Do you mean that they simply amplify the difference between the two signals?


Yup, that's the basic idea. It's a very simple circuit, which was the appeal before the digital everything era we live in now.


There's also ultrasonic delay lines. That's how they did reverb back in the analog only days.


a reaaaaaaaalllllllyyy long wire :P

pro audio equipment sometimes used 'bucket brigade' chips to implement a delay line (shame that you can't get them anymore)


Good news. Bucket brigade (BBD) chips are still produced and available from various sources. [0][1] They are used today in a lot of guitar/synth gear.

[0] https://www.electrosmash.com/mn3007-bucket-brigade-devices [1] https://www.coolaudio.com/features-page.php?product=V3205SD

They way they work by design, clock noise needs to be filtered out of the final signal, so relatively heavy low pass filtering is standard. The result isn't very hi-fi.


whoa ty! just unblocked a 5-year old project :)



And they even have this digitally (fiber optic). Relevant Tom Scott: https://www.youtube.com/watch?v=d8BcCLLX4N4


At that time, the most common way to do stereo enhancement was to decrease the L + R component of the stereo signal. [0] L+R/L-R was (and still is) a common way to encode stereo signals, including for FM radio. [1]

The impact on the stereo field by just changing the mix of these two components is profound. No signal delay needed.

For signals that are simple L and R, sum them to get L+R and difference to get L-R. So you can use this technique on any stereo source.

[0] https://www.sweetwater.com/insync/stereo-enhancement-work-mo... [1] https://en.wikipedia.org/wiki/FM_broadcasting#Stereo_FM


I may be misremembering—I was a kid at the time—but I distinctly remember the "wow" effect only working if you were sitting in the driver's seat. To me that suggests that the signals were delayed to the nearer speakers to center the sound stage around the driver rather than some arbitrary point above the center console.

But looking up photos online, I see the button I was talking about labeled as "Ambience", which kind of suggests the method you're describing.

It also turns out that using the Internet to find technical info about a 40-year-old car that was never popular to begin with is very hard!


The simple way is to add some cross-channel antiphase to both channels. So R = (R - cL), and L = (L - cR) where c << 1.

This is very cheap and easy with opamps.

It does indeed sound magical, and makes the stereo image expand beyond the speakers.

It also does weird things to a mono mixdown and if c is too big you get a hole in the middle. But if you keep c small and use it as the final effect just before the speakers that's not a problem.

You could also use BBD[1] chips to add a ms or so of analog delay, but that's less likely because it would have been more complex and expensive.

Digital audio delays had appeared in studios by the late 1970s, but they were still more expensive than BBDs in the mid-80s, so unlikely for in-car use.

[1] Bucket Brigade Delay


My friend had an acura in the 90s with physical EQ sliders. Can't remember if it was OEM or not but my guess is that it was.


I built a pair of full-range speakers and was therefore introduced to that effect — an effect I had only experienced before with headphones.

No special modern receiver in my case, just simple speakers. It seems the 2-way, 3-way speakers I grew up with kill stereo imaging. (Never mind the crossovers eat power and diminish the efficiency of the speaker — requiring a higher current amp, etc... Lovely what a small ½ Watt tube amp and a pair of full-range drivers can sound like ... and throw in a sub.)


> it seems the 2-way, 3-way speakers I grew up with kill stereo imaging.

Yeah there are ways to build crossover networks that can minimize these issues (phase shift) across the frequency range. The most ideal crossover would dissipate 100% of the undesired acoustic power as heat rather than storing it as reactive energy in inductors, but the frequency domain be a tricky beast to dance with.

The best overall approach is probably the 4th order Linkwitz-Riley filter:

https://en.wikipedia.org/wiki/Linkwitz%E2%80%93Riley_filter#...


This is why both sets of speaker cables to my amplifier are the same length.

It does make me curious about the recording process and effects if one microphone has lets say 50 feet of cable and another on a different musician has 100 feet of cable.


It's easy to do the math and show this is nonsense.

If a foot is a nanosecond for c, then 50ft difference is 50ns or about 10000x times smaller than the smallest difference. Even if the speed in a cable was 1%c, it's still under the proposed threshold.

Even so, try convincing my dad that it does not matter.


Thinking it's the cable length is a decent hypothesis, but it's almost completely due to the different times it takes for the sound waves to travel from the speaker cone to your ears.

But maybe you meant that the cable lengths helped place the speakers symmetrically.


It's not clear from the article, but I think what they are doing is delaying part of the frequency range of a signal. All this talk of negative delays is really confusing, it's just that they are delaying frequencies outside the range they are considering by a larger amount than within their range of interest. The overall effect is the same, there is going to have to be latency through the system.

So, I presume, their listeners are hearing a timbral shift caused by the harmonics being advanced/retarded. This is not the same effect as feeding a delayed signal into one ear and a non-delayed into the other (the relative phase being used for locating a sound).

They are I guess exploring how accurately they need to reproduce an impulse response to make an accurate transducer, and from the article, I believe they have concluded that they need to be more accurate than they previously thought. Given Genelec make a range of speakers with DSP for this sort of thing, it's I guess partly a marketing campaign to convince people that there is some benefit from their DSP corrected monitors.

Of course, for a producer to have an accurate monitor is useful, but if the listening public have non-aligned drivers, the sonic benefits from worrying about this stuff are of somewhat limited value.


The claim of novelty for time-travel filtering seems odd, so I wonder if this was a mistake in transcribing the quotation for the article. Any FIR filter can effectively time travel if you consider the peak to be time zero. Negative delay of one part of the sound is positive delay of the rest. It's not new to this article, but this is something that is new with DSP that is not so easy without DSP.

The DRC room correction software could achieve excellent results with frequency-dependent temporal correction back in the early/mid 2000s: http://drc-fir.sourceforge.net/


I believe that the BBE Sonic Maximizer (http://www.bbesound.com/products/sonic-maximizers/default.as...) products fix that. They (time) align the components in the frequency spectrum, for lack of a better description. edit: Or rather, it works as a dynamic EQ to dampen/amplify certain frequencies.

Most guitarists that have a rack-setup, probably have or have tried these - in practice, the effect is just a more smoothed sound.


Yes, the way they describe it with "time traveling" is especially confusing. I still don't understand exactly what their claim is or what they intend on selling.


You comment is sensible but what's a practical use case when you mix any two natural sounds with a delay and not get harmonics? So even if the ears are detecting the delay via harmonics I don't see why that's a negative thing.


Right, it seems people can detect the .5ms phase difference in the timbre. It is a noticeable difference, but they aren't detecting the delay, per se.


Takes me a minute to remember monitors and speakers can be the same exact thing, and not a monitor with speakers built in (like a TV).


I spent lockdown coming up with a music podcast with a friend, it's amazing how much even a little delay can throw you off. I have a hybrid digital/analogue setup where an analogue mixer is fed either from a digital source by a DAC or an analogue source via my microphone or a turntable. The analogue audio then goes through some analogue processing equipment and is then fed into the ADC and recorded/streamed digitally. It's not the most efficient setup in the world, but I have a real thing for flashing lights and physical switches!

If you talk into the mic with your headphones plugged into the analogue side then switch to monitoring the software instead, you can really notice the latency even though it's practically not that much. I read about a technology that's supposed to prevent angry customers from giving poor call centre employees a tirade of abuse by echoing their voices back at them on a very slight delay which is apparently quite intolerable, and I could believe it!


> echoing their voices back at them on a very slight delay which is apparently quite intolerable, and I could believe it!

Sometimes I get this effect on my normal calls, and it is pretty awful. Echo wasn't nearly so bad on circuit switched calls, but now that everything is digital and has sampling and codec delays, the echos come back so much slower.


This is known as Delayed Auditory Feedback

https://en.wikipedia.org/wiki/Delayed_Auditory_Feedback


Meanwhile, Android Chrome deliberately introduces a ~300 millisecond audio delay to make web games impossible and force game devs to use the Play Store so that Google gets its pound of flesh. The superpowered[1] database has many measurements showing how bad Android is.

Those of you with Android devices can try this ancient web audio demo to see that the issue remains unfixed all these years later, except on certain select premium devices (certain Samsungs, etc.).

https://webaudiodemos.appspot.com/TouchPad/index.html

You should hear sound the instant a rectangle is touched/clicked and you will with a desktop browser but not with most Androids.

Apple also employs dark patterns on iOS Safari to make web games impossible:

1. No full screen. ("But the user wouldn't know how to exit full-screen - unless they buy from the App Store! We also plan to make full screen PWAs impossible because of, like, Reasons and totally not because of our 30% App Store cut!")

2. Slow WebGL and no WebGL 2.0. ("But we need to run extra shader validation security checks every single frame even though we technically only need to do it once up front! But this extra Privacy and Security[TM] is totally unnecessary if you buy from the App Store!")

3. No device motion tilt control. ("Allowing the user to consent to tilt control would violate the user's Privacy and Security[TM] - unless they buy from the App Store!")

4. No JS debug console unless you register an Apple developer account, download 10+ gig Xcode on a Mac-only computer, and ask Apple for authorization to debug javascript on your own freakin iPhone you paid for with your own damn money.

I was forced to give up using "open" web technologies for games and instead go with native code game engines (Godot, Unreal, Unity, ...).

[1] https://superpowered.com/latency


I remember way back when the original Xbox was out, and modded. Whatever video player it had let you adjust the sound delay for the (new at the time) mpeg4 rips of whatever movie or show you downloaded.

It was like a light switch went off when you got the audio delay synced up perfectly. I guess I recall this vividly because 1) it was a huge problem and 2) the adjustment to audio delay was on the shoulder buttons, while fast forward and rewind were on the triggers. Such a large problem that the adjustment wasn’t buried in menus, but right there on the controller. Prime real estate.


Humans can ignore the desync between audio and video to about -125 ms to +45 ms. Ideally you don't want to be off by more than a frame or two though.

If the desync goes out of these bounds then watching someone speak becomes uncomfortable.


Using a rough rule of thumb that one millisecond equals one foot, it kind of makes sense that we would be able to follow speech coming from someone up to 45 feet away. But the tolerance on the negative side is surprising.


It might have been XBMC (xbox media center). Great software and it still exists today on several platforms (PC, Android TV, etc) under the name of Kodi.


Humans can do way better than that, somewhere in the 20 microsecond range. See for example Mills (1958). https://scholar.google.com/scholar?cluster=11401696136259886...

Edit: in fact, humans are among the best animals at all at this task, I think it is not really clear why. They are on par with highly specialist auditory hunters like barn owls. May be that humans can use their higher cognitive faculties to somehow get better at the task than the average animal but it's not clear to me how that would happen.


Since humans rely on groups and communication it is probably to hear/locate who in a group is talking.


I would imagine this might not be an ear mechanics issue, but a brain processing issue. Some animals have dedicated "processor" and other do not. While in the larger more agile human brain it's simply one of many features. Maybe?


> The sounds easiest to identify were a castanet, a percussion instrument, and short clicks.

Not coincidentally, this is exactly the type of music where lossy encodings such as MP3, Ogg or AAC fail.


So what does that mean for a large symphony where musicians are spread out much farther than sound travels in a millisecond? Let's say it's 10m, that would be about 29ms. I've heard really good musicians and conductors compensate for that but I'm doubtful.

So we must be talking about more subtle effects, like phase shifts, and not about how the waveform envelopes line up.


To add on to what the other commenters have already said - this is especially true for marching bands where you are even further spread out and your distance to various sections changes over the course of the performance. You have to learn to (at least somewhat) disregard what your ears are telling you and focus solely on the drum majors for staying in time. Listening to the drumline to keep yourself on beat can be a very bad idea when they're half a football field away.

This effect is amplified even more for enclosed stadiums where there tends to be a large amount of echo, and can really be quite disorienting the first time you experience it.


It's one reason that a conductor sends visual cues to the whole orchestra at the speed of light. That's the base clock source


one interesting thing is that you will fairly frequently see professional orchestras be out of phase with the conductor, but everything still works out.


Does a really good conductor train each section to come in on different parts of the visual cue then, e.g. double bass would play before the violins in the front? I guess you can't precisely solve this problem for more than a single listener.


If the listener is relatively further away from the orchestra than the orchestra members are to each other, it mostly all balances out, especially when you add the room and bandshell reverberation (which tends to wash the note onsets to some degree).

This is also helped by the instruments in the back of the orchestra favoring the low frequencies, where the listener has less access to precise timing information.


As a (former) bass player, you need to start a bit early, yeah. But that's not so much for distance but because it takes longer for the sound to start. Pipe organ is of course much worse. I would do one concert a year sitting next to an organist... The keyboard noises and motion from him playing way before me were pretty distracting. And when we stopped during rehersal, he'd stop on the keys right away, but the organ would keep going for a bit.


Yeah, I noticed that when I started playing upright bass, compared to the bass guitar the sound starts with a noticeable delay.


Since light travels a lot faster than sound, I imagine the visual cues of conductor and/or lead of your section and practice of staying on beat matter more than hearing someone across the symphony.


It’s the same for organists in large cathedrals where there is a huge delay and subsequent reverb, as the pipes are often quite far from the player and sound bounces off every conceivable hard surface before making its way back to the organist. They learn to compensate for this mostly by just being very accurate players and trusting that the outcome is correct.


It can do a whole lot better than that, it can detect phase changes at a very small fraction of a millisecond. It has to because that's how we determine direction of a sound source at low frequencies.


That it can do that with 2 ears and the latency between them signalling through the brain is amazing


You can even hear the interference between two sine waves, without them actually "mixing" in the air, by feeding one to each ear. This doesn't work for every frequency, but when it does it sounds the same as it would if they were both played to both years, but it's happening in your brain!

https://en.wikipedia.org/wiki/Beat_(acoustics)#Binaural_beat...


I've done this and it makes you tired extremely quickly for some reason. It's worse than tuning up a piano from scratch.


Well, it's not just latency in many cases. If there's someone typing at the desk to your left, that noise has latency but it has different loudness in your left and right ears. The sound reaching your right ear has to make it's way around your head, so it sounds different not merely more delayed. Your brain has learned what all of that means.


In addition to loudness and latency differences, there is even a frequency-specific effect due to the shape of the head and in particular your ear (your pinna to be precise). This is actually probably the reason why our ears have the weird shape they do. This is different for every person, which means that headphone-based simulated spatial audio can never be perfect.


Was just thinking about this the other day. In music production you can determine this; you have to be able to hear just a millisecond of difference sometimes to align things well, especially when you're working with instruments that have considerable delay (e.g. almost all Kontakt instruments).


Essentially anything that is MIDI based or has an audio buffer somehwere.


And also with samples. If you take, say, a tuba the player must start blowing well before the beat so there is a natural breath you want to preserve. In midi there is no real way to preempt that except by playing the sample early as best as possible. I trained tuba player will just know instinctively.


Yes but what I meant is that even though you place a midi note directly on a beat, many instruments still have a pretty considerably delay, in the order of 10-100ms.


Not only that, those delays themselves tend to change depending on how 'busy' the instrument is, for instance, whether you are using layered patches, multiple notes at once, different instruments at once and so on. It can change from one note to the next, sometimes without any clear hint as to why that is the case.


Oh I knew this. Doing the Sococo media mixer, we understood that audio sensitivity is orders of magnitude more critical that video. Heck, video could stall and delay and be fuzzy and folks didn't say much. But miss a single sample of audio at 44KHz and folks would speak up.


> castanets and short clicks

I was doing just this experiment three days ago. I have a new audio interface, and used audacity to try to establish its round-trip delay. I created a rhythm track using a click sound, connected one channel's playback to an input, and recorded it. The latency was about 35.1 ms.

Audacity (the 2.x series) allows adjustment of latency compensation in whole milliseconds.

Edit: I set the latency adjustment for -35 ms so there was only the 0.1-ish residual latency.

Playing one channel of the original track and the recording together (after adjusting the channel balance for equal loudness) was quite disconcerting.


Your buffer size is probably too high. Try 128 or 64


That is always going to be a compromise between the chance of a buffer underrun and annoying latency. Compared to analog digital audio solves some problems but introduces a whole raft of new ones.


Yes, in fact, human ear can detect up to ~1/20000s delays in sound...


I don't understand this. Delayed how? Does this mean if you are watching a video, and the person is talking, that a delay of milliseconds between the person's lips moving and when you hear it? But what does that have to do with castinets? I wouldn't be able to see them they probably move so fast.

But they don't mention video matched against audo delay, that I read. If you start a song and they delay it by 10 seconds, it just starts 10 seconds later, how would you even know there's a delay? Or if there's a delay in the middle of a song, how would you know any difference, as maybe that's how the music piece is supposed to sound.

What am I not getting?


Highly recommended: Seeing The Visitors at ICA Boston. It's a dozen-ish musicians playing in separate rooms of a big house where they can't see each other and can only hear each other over headphones. It's presented as nine channels of video, each with their own audio.

The effect is pretty wild and magical. The music, I suspect, is intentionally written to not require precise timing, and part of the charm of the piece is the musicians feeling each other out as they play. It definitely plays with your expectations of what constitutes music and how you hear timing in music.

The ICA owns the piece (a copy of the piece?), but it isn't currently on display :-(

https://www.icaboston.org/exhibitions/ragnar-kjartansson-vis...


I saw it at the Broad in LA, and it was mind-blowing.


Most any audio engineer or audiophile will confess the vast majority of a music's spaciality is dictated by the room. For the 3-nines of us who cannot dedicate a room for ideal echo, decay,absorption making all the temporal DSP tech overtake the influence of the room is a sysyphean task.

So when my son was born profoundly deaf in one ear, as an audiophile, I soon realized "time to prove the merits of single channel."

So far that just means my everyday workstation has a single Tivoli Model One radio as the output transducer and it sounds good enough for watching films and playing Spotify on the desktop.

My intended future hifi is a single Lowther speaker cabinet driven by a modestly powered single ended tube triode amplifier.

Long live monaural!

/Acey


There's a saying in real-time audio processing that "the ears don't blink"


When walking along the fence on the right side of this street

[1] https://www.google.de/maps/@53.5546251,9.8651002,3a,75y,192....

it feels like I could count every single echo of my steps from every single element of the fence if I could count fast enough. Similar when riding my bicycle slowly down there (rarely, because downwards, so I'm doing 50 kph there usually), ring my bell, or click my tongue.

edit: I mean, I'm hearing the difference in the delay depending on the distance.


I'm confused about why the methodology that they talk about is innovative. They talk about shifting when a sound event of a given frequency arrives, relative to another sound. But with digital audio, esp that you can prepare in advance, isn't this pretty simple? Like, given signals A and B, apply a band-pass filter to A (for the desired frequency window), truncate off the first k samples based on the time shift of interest, and add to B. Even if you do need it in real time, why isn't this equivalent to delaying B by k samples?

I must be missing something, because the researcher makes this sound like a cool new technique.


Hearing predators approaching from behind is the evolutioning justification for this trait.

Oculus, and other 3D platforms, utilize HRTF (head related transfer functions) with subtle millisecond phase shifts to localize objects.


I was playing with my synthesiser the other day and realized what I was really doing is using my eardrum to explore the search space of oscillation patterns, and their combinatorial sets.

Music in this sense is the class of all time series of combinations of such patterns that we find resonating as we are able to encode and decode emotions and feelings from these sounds (as well as appreaciation for them).

I wonder what other cool things we can do with this singal processing ability of ours as we enter the age of brain computer interfaces and psychedelics.


VLC and adding/removing the sound delay. Not half-millisecond timing, but man it got super annoying +/-5ms.


I initially thought about synchronisation with video too (aka lip sync). However, I don't think they are talking about video but merely whether a difference is detectable (rather than acceptable). I suspect the threshold for lip sync acceptability is a lot higher than what they measured here. I would have thought the threshold was higher than 5ms, but I haven't done any rigorous testing.



Thanks, that's really interesting, especially how video lag is more detectable than audio lag. I recently set up a display with a 130ms lag and this explains why it was so bad before I corrected the audio!


Delay compared to what? Video? I doubt normal stereo audio files are out of sync.


The article is quite fluffy, but from what I gather it is about the difference in arrival time when a sound is perceived through multiple filters for different frequency bands, each of which has a programmable delay.

One possible application for something like this would be to design a loudspeaker where the time-of-flight of the soundwave travelling from a tweeter to the listener would be the same as the same sound sent from a midrange speaker. Most designs do not adjust for such differences.

The more interesting version - to me, at least - would be where you compare between the two arrivals at your left and right ear, which can do very small fractions of that even with relatively high pitched sounds allowing for sound source location based on phase.


> One possible application for something like this would be to design a loudspeaker where the time-of-flight of the soundwave travelling from a tweeter to the listener would be the same as the same sound sent from a midrange speaker. Most designs do not adjust for such differences.

Most designs, true. Very high-end speakers do sometimes incorporate all-pass filters in parts of their crossovers to compensate for the set-back of the apparent sound source in woofers. But very few people tilt their speakers so that the distance from the midrange and woofer to their ears is exactly the same as that from the tweeter at the listening position, and getting complicated crossovers to be reliable at 100 watt power levels is very expensive.

Time-of-flight compensation is trivial in the digital domain, so some active crossovers (which operate on line level signals, before the power amplifiers) also do this. The downside, of course, is needing one power amplifier per speaker cone, not just one per box, and the extra wires if using separate amplifier blocks rather than active speakers.

The consensus seems to be that it's not worth the trouble in ordinary listening situations.


The one-amp-per-device is actually quite common in monitor speakers.


I think most speaker setups have a microphone that you place where the listener is to calibrate. Don't know if that involves delay measurements.


There was the Philips concept called MFB which integrated an accelerometer into the low frequency speakers in order to match the desired waveform with the one that actually went out. Quite a neat little concept, they are still somewhat popular in the HiFi scene.

The microphone systems mostly calibrate for frequency loss across the spectrum, time-of-flight adjustments I haven't seen but it's quite possible they're out there somewhere, but they would require a pretty nifty delay mechanism or a digitization step and then turning the signal yet again back into analog which would probably cause more problems than it solved.


This is news? :o


>This work demonstrates how the group-delay response of headphones and loudspeakers can be perceptually tested, and leads to a better understanding of how audio systems should be equalized to avoid audible group-delay distortion.


I find all of this hilarious as what I've learned about audio production is that delays are used as equalizers, comb filtering effect used in practice. Not just on analogue devices, I was told once by (AFAIR?) Drumgizmo devs that digital as well!

Btw. Drumgizmo is an amazing opensource drum plugin, that you mix like you would mix any real drum.


You don't want your loudspeakers to have their own distortions.

You want to transmit the distortions audio production made without add-ons.


Isn't this how we hear directionally? 300m/s means 150mm distance. Ears are only about 250mm apart, so if you couldn't hear a 0.5ms diff you'd have no idea where sound was coming from right?


A timing difference is not the only way to tell from which side a sound came. Since the ears are at opposite sides of the head and the external ears have particular (but unchanging [0]) shapes, a sound will be changed in a different but predictable way depending on the direction it comes from. This effect also helps with detecting whether a sound comes from the front/back or top/bottom of the listener.

https://en.wikipedia.org/wiki/Head-related_transfer_function

[0] Change the environment around the ear a bit, e.g. by putting a hand relatively close to the external ear and listen how doing that changes sounds.


Thanks! That was a good read!.


For some reason I thought this was about reaction time. Of course you can detect small variations in time. Even visually the eye will see differences between a light on for 1ms and one on for 10ms.


Something I've been wondering is, assuming that the brain works at a couple dozen hertz frequency, how can we perform any sub-10ms task?


Wouldn't the brain send the signals pre-emptively such that they trigger the action at the correct time, accounting for delays? I don't think it's possible to perform any conscious action that quickly, but if you've trained to perform some pattern, your brain would learn to compensate for any processing delays.


There goes my ambition in syncing musicians remotely. I guess introducing a little bit of delay and using GNSS clocks would sort of help.


Given the speed of sound, 0.5ms delay occurs when you talk to anybody more than half-a-feet away




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: