VAD support (no transfer when silence)

Hey,

I tried to search everywhere but I could not find if it is possible to use voice activity detection and volume threshold with mediasoup. I tried to detect it with AudioWorklet and trigger pause on stream but when producer is paused then stream stops and it is not possible to detect volume anymore. And if I use multiple streams (one for sending and second one for detection) then there is small delay for first couple of chunks which are not transfered after resume() call and audio is cut in the beginning. I need about 100 users connected and their data transfer must be paused when not needed in order to handle so many users via webRTC.

Is VAD supported or there is no way of pausing producer client side when there is silence?

Thanks for info.

I just found out there is option for disableTrackOnPause: false which allows AudioWorklet to keep detecting volume and call events for pause/resume in main js thread via this.port.postMessage so there will be only minimal delay because of the message. (still think it is gonna miss first chunk because of the delay (5µs) caused by message)

Is this approach correct or am I missing something?

You could clone the track instead and let AudioWorklet work on the cloned one without resorting to disableTrackOnPause: false.

@snnz 's approach is the one I use. Clone the track, connect it to an analyserNode and use that to measure volume. All of this happens on the client side, you don’t need to involve the server at all. You also don’t need to use mediasoup, it’s all web audio API based.

Are you sure I don’t have to use mediasoup? If I don’t call pause() then data is being transfered all the time. With 10 users it is about 2 mbit/s bandwidth. Also even if I connect analyserNode I still need to have AudioWorklet to process every chunk and only way to communicate with workers is postMessage which causes delay so pause/resume won’t happen fast enough for first chunk of speaking therefore being lost and when you say some short letter fast other consumers won’t hear it.

And setInterval is not fast enough and I would have to buffer old chunks and then somehow prepend them to producer stream when speaking starts

Oh I see what you’re trying to do. Yeah, it’s a challenge, you do need to use pause() client side. You maybe don’t need an audioWorklet for performance reasons (I do this at a rate of 60 analyses per second and it works fine in the main JavaScript thread even on mobile devices), but that’s not really your limiting issue.

You could look into buffering the audio on the client side, but that would introduce delay and make your app non realtime. That would be purely web audio API based. There’s also an option with mediasoup where you can process individual rtp packets server side, but then you wouldn’t ever be able to pause.

Are you sure you want to structure it like this? With 100 people on simultaneously it’s basically guaranteed that people will be talking over each other inadvertently. Plus people generally want the option to mute themselves. Have you considered doing push to talk?

Another thing to consider is enabling DTX in the opus codec. It reduces bandwidth substantially during periods of silence. You might be able to scale your bitrate down too if you’re only transmitting speech. You get excellent results with only 40kbps and can go as low as 8kbps and retain reasonable quality.

https://developer.mozilla.org/en-US/docs/Web/Media/Formats/WebRTC_codecs

I am creating VOIP for in-game integration. Distance, 3D spatial, effects (reverb, muffler, underwater). So everyone is connected to the same room but they are muted for each other unless in required range with enough volume. So I think this is the best approach. Maybe I am wrong but I see it as most efficient.

Thanks for info about DTX.

It is just postMessage delay which sometimes causes first non-silent chunk to be skipped. Like 1 in 30 resume() calls but since it is quite rare, perhaps I am gonna just leave it that way and recommend Push-to-talk feature for users.

Consider trying to do the audio processing in the main thread - I don’t think it’s terribly CPU intensive but you won’t know until you experiment. I took a look at audioWorklet in MDN earlier and it’s not (yet?) supported in Safari which has substantial market share and might cause some issues for you.