Ask HN: How do browsers isolate internal audio from microphone input?
by dumbest
I've noticed an interesting feature in Chrome and Chromium: they seem to isolate internal audio from the microphone input. For instance, when I'm on a Google Meet call in one tab and playing a YouTube video at full volume in another tab, the video’s audio isn’t picked up by Google Meet. This isolation doesn’t happen if I use different browsers for each task (e.g., Google Meet on Chrome and YouTube on Chromium).
Does anyone know how Chrome and Chromium achieve this audio isolation?
Given that Chromium is open source, it would be helpful if someone could point me to the specific part of the codebase that handles this. Any insights or technical details would be greatly appreciated!
The way this works (and I'm obviously taking a high level view here) is by comparing what is being played to what is being captured. There is an inherent latency in between what is called the capture stream (the mic) and the reverse stream (what is being output to the speakers, be it people taking or music or whatever), and by finding this latency and comparing, one can cancel the music from the speech captured.
Within a single process, or tree of processes that can cooperate, this is straightforward (modulo the actual audio signal processing which isn't) to do: keep what you're playing for a few hundreds milliseconds around, compare to what you're getting in the microphone, find correlations, cancel.
If the process aren't related there are multiple ways to do this. Either the OS provides a capture API that does the cancellation, this is what happens e.g. on macOS for Firefox and Safari, you can use this. The OS knows what is being output. This is often available on mobile as well.
Sometimes (Linux desktop, Windows) the OS provides a loopback stream: a way to capture the audio that is being played back, and that can similarly be used for cancellation.
If none of this is available, you mix the audio output and perform cancellation yourself, and the behaviour your observe happens.
Source: I do that, but at Mozilla and we unsurprisingly have the same problems and solutions.
This reminds me of:
>The missile knows where it is at all times. It knows this because it knows where it isn't. By subtracting where it is from where it isn't, or where it isn't from where it is (whichever is greater), it obtains a difference, or deviation
the missile is eepy https://youtu.be/Csp_OABIsBM
Up to a point that text makes a lot of sense for describing a PID controller, which is a form of control that only really looks at error and tries to get it to zero.
>a PID controller, which is a form of control that only really looks at error
As the name implies the PID controller relies on proportional, integral and derivative information about the error. What you mean is a purely P controller, which just relies on the error.
Missiles are also not guided by a PID controller, that would be silly. They (or the guidance computer in the airplane) has to take into account the trajectory of the target and guide the missile in a way to intercept that target, which is not something you can accomplish with just a PID controller.
It wouldn’t surprise me at all if early heat seeking missiles used just a PID controller, since a big part of what makes PID attractive is the ability to implement it with electrical components. Take a pair of IR photodiodes and wire them such that their difference is the error of your PID control, wire the output of the PID to the steering on your missile, and suddenly you have a missile that points at the nearest IR target (on one axis of course).
Modern missiles do better than this, but a missile wired this way with a proximity fuse would hit the target a reasonable amount of the time. Not silly at all if you haven’t invented microcontrollers yet.
"Although proportional navigation was apparently known by the Germans during World War II at Peenemu¨nde, no applications on the Hs. 298 or R-1 mis- siles using proportional navigation were reported [2]. The Lark missile, which had its first successful test in December 1950, was the first missile to use pro- portional navigation. Since that time proportional navigation guidance has been used in virtually all of the world’s tactical radar, infrared (IR), and television (TV) guided missiles [3]. The popularity of this interceptor guidance law is based upon its simplicity, effectiveness, and ease of implementation. Apparently, proportional navigation was first studied by C. Yuan and others at the RCA Laboratories during World War II under the auspices of the U.S. Navy [4]."
From Tactical and Strategic Missile Guidance Sixth Edition.
(To preempt the confusion. Proportional navigation isn't a simple P controller, the missile is seeking an intercept path)
>Not silly at all if you haven’t invented microcontrollers yet.
Apparently the Germans did try that during WW2, but such a missile can not be effective, outside of e.g. bomber intercept.
The "magic" of the AIM-9 Series is that it could achieve this without micro controllers.
>The "magic" of the AIM-9 Series is that it could achieve this without micro controllers.
The real magic were the fearless carrier pigeons and self-less kamikaze fighter pigeons missileers.PID or even just PI are often used to control missile airframe roll rate. Pitch and yaw may be more advanced systems, sure.
I've observed that hacker culture exists because DARPA funded institutions like MIT for AI research, because the military wanted the missile to know where it is.
This is almost weirdly philosophical. I've been thinking about this all morning.
For a little more context on negative feedback to those who want to know more (I believe this is what you're referring to?)
Here's a short historical interview with Harold Black from AT&T on his discovery/invention of the negative feedback technique for noise reduction. It's not super explanatory but a nice historical context: https://youtu.be/iFrxyJAtJ7U?si=8ONC8N2KZwq3Jfsq
Here's a more indepth circuit explanation: https://youtu.be/iFrxyJAtJ7U?si=8ONC8N2KZwq3Jfsq
IIRC the issue was AT&T was trying to get cross-country calling, but to make the signal carry further you needed a louder signal. Amplifying the signal also the distortion.
So Harold came up with this method that ultimately allowed enough signal reduction to allow calls to cross the country within the power constraints available.
For some reason I recall something about transmission about Denver being a cut off point before the signal was too degraded... But I'm too old and forgetful so I could be misremembering something I read a while ago. If anyone has more specific info/context/citations that'd be great. Since this is just "hearsay" from memory, but I think it's something like this.
It just seems more logical for the OS to do that, rather than the application. Basically every application that uses microphone input will want to do this, and will want to compensate for all audio output of the device, not just its own. Why does the OS not provide a way to do this?
> Basically every application that uses microphone input will want to do this
The OS doesn't have more information about this than applications and it's not that obvious whether an application wants the OS to fuck around with the audio input it sees. Even in the applications where this might be the obvious default behavior, you're wrong - since most listeners don't use loudspeakers at all, and this is not a problem when they wear headphones. And detecting that (also, is the input a microphone at all?) is not straightforward.
Not all audio applications are phone calls.
>The OS doesn't have more information about this than applications
the OP pointed out that this only works if he uses a browser monoculture
the OS does have more information than that, it can know what is being played by any/all apps, and what is being picked up by the mic
The "OS" isn't special here, apps can listen to system audio.
fwiw, you only need to know anything about outputs if you are doing AEC. Blind source separation doesn't have that problem and can just process the input stream.
> The "OS" isn't special here, apps can listen to system audio.
Even if this is true, it's easy to imagine such functionality being exploited by malicious apps as a security and/or privacy concern, particularly if the user needs a screen reader.
It definitely makes sense for the operating system to provide this functionality.
The OS can have multiple sound input devices for the application to choose from, "raw" and "fuckarounded with"
That doesn't make sense in the context of default devices. MacOS's AVKit (or is it CoreAudio?) APIs that configure the streams created on the device makes way more sense, since it's a property of the audio i/o stream and not the devices.
Assuming this isn't parody, the OS doesn't have to do it automatically. Having an application grab a microphone stream and say to the OS "take this and cancel any audio out streams" might be pretty useful.
I agree with that, but the point I'm trying to make is that audio i/o handling is pretty complicated and application specific. The idea I'm challenging is that "any app that wants microphone input wants this" is dubious. I'd say it's only a small number of audio applications that care about mic input want background noise reduced - and it makes sense for this to be configured per-input stream.
Really what would be nice is if every audio i/o backend supported multiplex i/o streams and you could configure whether or not to cancel audio based on that set of streams but not all output (because multi output-device audio gets tricky).
I'm honestly having trouble thinking of a case where I wouldn't want this.
I'm sure there are some niche cases, but in those cases, the application can specifically request that the OS turn off audio isolation.
The technique introduces latency and distortion because it's subtracting an estimate of sound that's traveling/reflecting in the listening environment, which is imperfect and involves the speed of sound.
That latency is within the tolerance that users are comfortable with for voice chat, and much less than video processing/transfer is introducing for video calls anyway, so it's a very obvious win there. Especially since those users are most interested in just picking out clear words using whatever random mic/speaker configuration happens to be most convenient.
But musicians, for instance, are much more interested in minimizing the delay between their voice or instrument being captured and returned through a monitor, and they generally choose a hardware arrangement that avoids the problem in the first place. And that's not really a niche use case.
Live video or audio chat is basically the only time you do want this. Granted, that’s a big chunk of microphone usage in practice, but any time you are doing higher fidelity audio recording and you have set up the inputs accordingly you absolutely do not want the artefacts introduced by this cancellation. DAWs, audio calibration, and even live audio when you’ve ensured the output cannot impact the recording all would want it switched off.
Default on vs default off is really just an implementation detail of the API though, as you say.
> Live video or audio chat is basically the only time you do want this.
If I'm recording a voice memo, or talking to an AI assistant, I would want this. Basically everything I can imagine doing with a PC microphone outside of (!) professional audio recording work.
That last case is important and we agree there needs to be a way to turn it off. I think defaults are really important though.
My colleague works in a very quiet house, and has no need for noise cancelling. Sometimes, he has it turned on by accident, and the quality is much worse - his voice gets cut out, and volume of his voice goes up and down.
As you say, as long as either option is available, the only question is what the default should be.
I gave an example, when I'm wearing headphones I don't want this enabled. If I'm recording anything, I probably don't want it on either. If I'm using a virtual output, I don't want AEC to treat that as a loudspeaker.
Every normal application already does it through the os because most do not care about this at all.
Music player, browser, games, video player...
Audio is not app specific
The only application were this is true is audio were you want full control and low latency.
I find your take very weird.
Comment was deleted :(
> Why does the OS not provide a way to do this?
Some do.
But you need to have a strong-handed OS team that's willing to push everybody towards their most modern and highly integrated interfaces and sunset their older interfaces.
Not everybody wants that in their OS. Some want operating systems that can be pieced together from myriad components maintained by radically different teams, some want to see their API's/interfaces preserved for decades of backwards compatibility, some want minimal features from their OS and maximum raw flexibility in user space, etc
> Some do
Which Operating systems do this?
macOS has done this in recent versions. Similarly it will do all the virtual background and bokeh stuff for webcams outside of the (typically horrific) implementations in video conferencing apps.
Others have already pointed out macOS/Linux, here's Windows:
As others have noted, this is trivial for most macOS and iOS apps to opt in to.
Frankly, I imagine its also available at the system level on Windows (and maybe Android and Linux) but probably only among applications that happen to be using certain audio frameworks/engines.
It doesn't seem to me that module-echo-cancel in Pulseaudio completely meets the requirements here (only one source), but it looks close, and seems in general like where you would implement something like this.
1. https://www.freedesktop.org/wiki/Software/PulseAudio/Documen...
I think module-null-sink and module-loopback could be used to create a virtual source which combines multiple sources, though the source/sink thing makes my head spin. Or, more simply, I suppose using the loopback of whatever audio output device does the combination (and the same mixing) for you, if you play all audio through one output device (which is most likely)?
> though the source/sink thing makes my head spin
Wait, what other audio paradigms are there?
Dunno, I meant more that it's an unintuitive way of thinking about a data-flow graph to me, moreso when introducing virtual sinks/sources.
so something for systemD then?
Weird macrosensivity
Huh? What's the aggression in characterizing the diverse landscape of operating systems and the users/developers who very reasonably may prefer each?
I think it's very good that we have so many options of what an operating system and its vendors/developers might prioritize, and that these differences in priority have consequential impact on how software gets built on each.
On mac/iOS, you get this using the AVAudioEngine API if you set voiceProcessingEnabled to true on the input node. It corrects for audio being played from all applications on the device.
My first thought in reading the question was “if your browser is doing that, your platform architecture has… some room for improvement”.
Having room for nontrivial improvement is, to be fair, a normal state of affairs for platforms.
This has certainly made conference calls significantly more usable. I feel like it must have come around during 2020, because I feel like pre-covid I would go around BEGGING everyone I did calls with to get a headset, because otherwise everyone else's voice would echo back through their microphone 0.75s later. Today I recently realized I could just literally do calls out loud on my laptop mic and speaker and somehow it works. Nice to know why!
This assumes there is an OS-managed software mixer sitting in the middle of all audio streams between programs and devices. Historically, that wasn't the case, because it would introduce a lot of latency and jitter in the audio. I believe it is still possible for a program to get exclusive access to an audio output device on Windows (WASAPI) and Linux (ALSA).
Historically, true, but nowadays it's pretty much standard for all the big OS.
Being able to get exclusive access/bypass the system via certain means (ASIO would be another) doesn't make it go away.
The OS doesn't know that the application doesn't want feedback from the speaker, and not 100% of applications will want such filtering. I think a best practice from the OS side would be to provide it as an optional flag. (Default could be on or off, with reasonable possibility for debate in either direction, but an app that really knows what it wants should be able to ask for it.)
There is a third place: a common library that all the apps use. If it is in the OS then it becomes brittle. If there's an improvement in the technology which requires an API change, that becomes difficult without keeping backwards compatibility or the previous implementation forever. Instead, there would be a newer generation common library which might eventually replace the first but only if the entire ecosystem chooses to leave the old one behind. Meanwhile there'd be a place for both. Apps that share use of a library would simply dynamically link to it.
This is the way things usually work in the Free Software world. For example: need JPEG support? You'll probably end up linking to libjpeg or an equivalent. Most languages have a binding to the same library.
Is that part of the OS? I guess the answer depends on how you define OS. On a Free Software platform it's difficult to say when a given library is part of the OS and when it is not.
> If it is in the OS then it becomes brittle
My experience is the opposite. When it's part of the OS, it's stable and you just say "you need OS version X or better" and it will just work. When it's a library, you eventually end up in dependency hell of deprecated libraries and differing versions (or worst case, the JavaScript ecosystem when the platform provides almost nothing and you get npm).
Depends on the OS I guess. When it's established enough, all distributions carry a high enough version that it's not an issue. If it's not established enough, I'd argue that it isn't ready to be part of an "OS" anyway (regardless of the definition of that word).
I suppose the OS probably makes something like this available, when using Voiceover on Mac and presenting in teams by default only the mic comes into teams, you need to do something to share the other processes audio.
That's mac of course but in my experience Windows is much more trusting of what it gives applications access to so I suppose the same thing is available there.
How sure are you that Basically every application wants this? So should there be a flag at the os level for enabling the cancellation? How do you control that flag?
> How do you control that flag?
It would be trivial to pass that flag in whatever API the application calls to request access to the microphone stream.
Did you just invent yet another linux audio stack?
At the lowest level its a fouriertransform over a systems (your room the echochambers response is know from some testsound )and the expected output going through that transform on its way to the mic is subtracted. Most socks and machines have dedicated systems for that. The very same chip produces the echo of the surroundings.
Is there any way to apply this outside the browser? Like, is there a version of this that can be used with Pulseaudio?
To spare others from googling:
If you're still on pulseaudio for some reason, it ships with a similar module named "module-echo-cancel":
Huh, thanks. I was interested in this probably 6-8 years ago, and when I went digging the stackoverflow answer mentioned elsewhere in this thread [0] was as far as I got. I guess the tech has progressed since then.
[0] https://stackoverflow.com/questions/21795944/remove-known-au...
It was there 8 years ago.
It's called Acoustic Echo Cancellation. An implementation is included in WebRTC included in Chrome. A FIR filter (1D convolution) is applied to what the browser knows is coming out of the speakers; and this filter is continually optimized to to cancel out as much as possible of what's coming into the microphone (this is a first approximation, the actual algorithm is more involved).
To spare a search: https://webrtc.googlesource.com/src/+/refs/heads/main/module...
Remember that convolution is multiplication in the frequency domain, so this also handles different responses at different frequencies, not just delays
Search for the compilation flag "CHROME_WIDE_ECHO_CANCELLATION" in the Chromium sources, and you will find your answer.
Can't tell you anything else due to NDAs.
It's kind of nuts that (I'm assuming) the source code is publicly available but the developers who wrote it can't talk about it.
(I realize this situation isn't up to you and I appreciate that you chimed in as you could!)
This is super common.
When I worked at Mozilla, most stuff was open, but I still couldn't talk about stuff publicly because I wasn't a spokesperson for Mozilla. Same at OpenDNS/Cisco, or at Fastly, and now at Amazon. Lots of stuff I can talk about, but I generally avoid threads and comments about Amazon, or if I do, it's strictly to reference public documentation, public releases, or that sort of thing.
It's easier to simply not participate, link a document, or say no comment than it is to cross reference what I might say against what's public, and what's not.
Thanks, I see it's a user toggle too chrome://flags#chrome-wide-echo-cancellation or edge://flags/#edge-wide-echo-cancellation . All these years I was praising my macbook, thinking it was the hardware doing the cancellation, but it was Chromium the whole time.
Side note, this can also cause a bit of difficulty in some situations apparently as seen in a HN post from a few months ago that didn’t get much attention
> I've been working on an audio application for a little bit, and was shocked to find Chrome handles simultaneous recording & playback very poorly. Made this site to demo the issue as clearly as possible
Not sure if it's the whole story, but the latest response in the linked Chrome ticket seems to indicate that the api's were used incorrectly by the author
> <he...@google.com>
> Status: Won't Fix (Intended Behavior)
> Looking at the sample in https://chrome-please-fix-your-audio.xyz, the issue seems to be that the constraints just aren't being passed correctly [...]
> If you supply the constraints within the audio block of the constraints, then it seems to work [...]
> See https://jsfiddle.net/40821ukc/4/ for an adapted version of https://chrome-please-fix-your-audio.xyz. I can repro the issue on the original page, not on that jsfiddle.
The technical term that you're looking for is acoustic echo cancellation[1].
It's a fairly common problem in signal processing, and comes up in "simple" devices like telephones too.
[1] https://www.mathworks.com/help/audio/ug/acoustic-echo-cancel...
I seem to remember analog telephone lines used a very simple but magic-looking transformer-based circuit of some sort for this purpose. Presumably that worked because they didn’t need to worry about a processing delay?
That's slightly different. analog phone lines suffer from 'line echo', which is an echo generated by the phone line itself - because the same pair of wires is used for signals traveling in both directions. If you're thinking of the 'hybrid' , which connects the phone line to the speaker and mic, that matches the impedence of each of the three pairs of wire (to the mic, speaker, and phone line) to avoid generating echo in the first place. But they weren't perfect (and echo could be generated at other points in the line) so later, digital electronics was used to do first "echo suppression" (just zero any signal below certain magnitude) and then echo cancellation, which is very similar to acoustic echo cancellation (an AEC can probably do anything an LEC can do, but the LEC is less capable).
I'm not aware of anyone doing echo cancellation using an analog circuit, but that doesn't mean no-one did. I guess it's theoretically possible but I don't see how the adaption could work.
There used to be a transformer in phones for side tone, a small amount of what you say is piped back into the earpiece, they did this because they found people would shout if they couldn't here their own voice. I've often wished mobiles would do this.
Google Meet uses source separation technology to denoise the audio. It's a neural net that's been trained to separate speech from non-speech and ensure that only speech is being piped through. It can even separate different speakers from one another. This technology got really good around 2021 when semi-supervised ways of training the models were developed, and is still improving :)
A side effect of echo cancellation. Browser knows what audio it is playing, can correlate that to whatever comes in through the mic, maybe even by outputing inaudible test signals, or by picking wide supported defaults.
This is needed because many people don't use headphones and if you have more than one endpoint with mic and speakers open you will get feedback gallore if you don't do something to suppress it.
Have used audio a lot on windows/mac for a long time, and a bit of linux too.
I'd say it depends on the combination of the hardware/software/OS that does pieces of it on how audio routing comes together.
Generally you have to see what's available, how it can or can't be routed, what software or settings could be enabled or added to introduce more flexibility in routing, and then making the audio routing work how you want.
More specifically some datapoints:
SOUND DRIVERS: Part of this can be managed by the sound drivers on the computer. Applications like web browsers can access those settings or list of devices available.
Software drivers can let you pick what's that's playing on a computer, and then specifically in browsers it can vary.
CHANNELS: There are often different channels for everything. Physical headphone/microphone jacks, etc. They all become devices with channels (input and output).
ROUTING: The input into a microphone can be just the voice, and/or system audio. System audio can further be broken down to be specific ones. OBS has some nice examples of this functionality.
ADVANCED ROUTING: There are some audio drivers that are virtual audio drivers that can also help you achieve the audio isolation or workflow folks are after.
So Chrome and Chromium got this cool trick where they block internal audio from your mic. Like, you can be on a Google Meet call and blast a YouTube vid in another tab, and Meet won’t pick it up. No clue how they do it exactly, but since Chromium’s open-source, someone can probably dig into the code for the deets. If anyone knows the techy stuff behind this, spill the beans!
I think this would be part of echo cancellation: in a meeting you don't want the data from the meeting to be fed back to it. I suppose it uses the all the streams from the browser then, though I think in general it would be even better to cancel out everything that comes from the speakers. Maybe it can work this way on some other platforms?
E.g. PulseAudio and Pipewire have a module for echo cancellation.
Since Chrome has the PCM data it‘s writing to the speaker, it can use that to remove similar sounds from a mic. That‘s my guess.
There‘s a similar question on SO: https://stackoverflow.com/questions/21795944/remove-known-au...
"echo cancellation" is what its called, there's a few general purpose (non-ai) algos out there!
What's really interesting is I can get the algorithm to "mess up" by using external speakers a foot or two away from my computer's mic! Just that little bit of travel time is enough to screw with the algo.
Echo cancellation is often disabled if you have headphones plugged in, under the assumption that headphones won't be audible in the microphone, and it's better to disable it to avoid it degrading your microphone signal.
It might be that whatever program you're using doesn't know the difference between speakers and headphones (possibly because you're using the 3.5mm jack?)
jokes on them, some of us use the headphone jack
There's a whole subfield of "jack detection" (where the final amp stage attempts to determine what's plugged in) and probably a "jack assignment" panel in your driver software that can configure whether the OS thinks of it as "speakers" or "headphones".
Oh probably by the impedance/reactance and the DC resistance I'm guessing!
This concept is known as "Echo Cancellation" https://en.wikipedia.org/wiki/Echo_suppression_and_cancellat...
Since chrome has the data from both sources: the microphone, and the audio stream from YouTube, I imagine you can construct a filter from the impulse response of the YouTube source and then run the microphone through it
Guessing it's a feature of the WebRTC stack if it's to be found anywhere, there's always a requirement for cancelling feedback from other meeting participants
Windows 11 does this ! Videos playing on browser (edge, tested on youtube, linkedin videos) wont he heard to the meeting folks or folks on call in teams !
That's interesting. Following up, is there any reason why it wouldn't be possible for other browsers/applications? It seems like the operating system should be able to generally access the audio from any application
Windows 11 does that !! Videos playing on edge wont be heard by the folks on teams !!
You forgot to mention if you're using speakers.
You can use a multidimensional Kalman filter to do this.
You can use a multidimensional Kalman filter to do many things
But that does not really come close to answering the question.
Comment was deleted :(
This is something that is usually taken care of by the App that's receiving the input from the microphone (Google Meet, Teams, etc). The App breaks the audio into frequencies, and the ones that correspond to human voice ranges are accepted, and anything else is rejected. This is referred to as, for example, voice isolation, and has been turned on by default in all major meeting Apps for a little while now.
Surprised to hear that it doesn't seem to work for you when the audio is generated by a different browser, this shouldn't make a difference.
Assuming OP is correct, your last sentence implies this isn't the solution being used.
Additionally, many (citation needed) Youtube videos have people talking in them; this method wouldn't help with that.
Isolating vocals in general is significantly more difficult than just relying on frequency range. Any instrument I can think of can generate notes that are squarely in the common range of a human (see: https://www.dbamfordmusic.com/frequency-range-of-instruments...)
Was trying to informally describe the use of Fourier transformations to achieve the isolation. Success will vary depending on the situation, but ML is also used in more recent cases with more uniform end results for the particular use case.
The initial question may be specific to the way one particular browser handles things to certain degree, but the comment was also trying to communicate that it can go beyond the browser and can actually be handled by the application. However, the microphone itself can also be participating at some level if it features noise suppression or some other enhancements.
The surprise about things being different when using a separate browser, come from assuming that any audio reaching the microphone should be processed equally if using FTs (or machine learning if applicable), so the audio source shouldn't matter.
- https://www.nti-audio.com/en/support/know-how/fast-fourier-t...
- https://pseeth.github.io/public/papers/seetharaman_2dft_wasp...
Crafted by Rajat
Source Code