Is it possible to consume only active speaker audios?

Is it possible to consume only active speaker audios?

Like there are 500 participants in call then it will not be possible to consume all audios from 499 participants. Is there any way I can only consume the audios of those 2-3 people who are currently speaking?

The one way I am thinking of is that I change audio tracks based upon active speakers. Like on client side I consume 2-3 audios and then during the call use audio observer to check active speakers and then replaceTrack on those audio consumers based upon the active speakers. But the issue I can see is that replaceTrack can take 2-3 seconds which can lead to some audio part missing issue. Is this right way?

The other way is to merge the audio streams using ffmpeg etc and send it to client but that will be quite expensive on server side.

Can someone guide me on this matter?

Any update on this one? @CosmosisT @jbaudanza

Please don’t mention people in your questions. There is many people in this forum.

Yes, it’s possible to just consume audio for active speakers. Just learn and use the API of mediasoup.

Ok. I have mentioned 2 ways in the question, can you please guide me which way is the better one? I couldn’t find any other way in documentation, if there is any kindly mention that as well, thanks.

Pause/resume consumers on demand.

Ok you mean that if there are 500 participants in call and 100 participants have audio on then I will consume 100 participants audios initially and pause them when they are not speaking and resume them when they are speaking when the help of active speaker observer?

These paused consumers will count to the maximum consumers limit per worker? If 500 consumers are allowed per worker w.r.t cpu core, and if there 400 paused consumers in that worker then I will be able to consume 100 more consumers on that worker on safe side right?

You can pause consumers in creation time and such a theoretical limit is about RTP stream flows. Paused consumers for nothing.

Ok thanks for your time

You can’t assume the state of someone’s audio unless it’s on/off, dB reading won’t tell you anything but loudness and that loudness is no factor other than on/off for a speaker. So in the situation you want to switch a speaker, what logic would define that? What happens if more than a few users?

IMO I’d opt for just pausing/unpausing video when speaker is loud enough making their video show up on a large-screen versus many smaller cameras.

I horizontally scale my servers so that may not work out and I’ll explain why.

I’d require the media-servers I run to watch for and report any changes immediately but confirm with my routing server that user to the other users.

Because the other users could be on different media servers now I’d have x-many media servers yelling at my router cause 10-20 different users are changing states constantly…

Yikes…

I would probably strongly avoid this operation or carefully plan it so that it’s efficient and you aren’t signal abusing your server. There are ways to make it work for sure though, heck you can try using datachannels to transmit this information between servers.

You mean audio observer only tell us whether audio is on or off but it can’t tell whether someone is actually speaking or not? So audio observer will give positive response if someone’s audio is on regardless of the fact that person is speaking or not?

As per my understating, based upon the documentation I gone through a month ago, is that it will actually tell us whether someone is speaking or not. Can you confirm?

I will not go with replacetrack way as I mentioned in question, but rather I think the good way is what Inaki Baz mentioned above in which we will have to pause/resume the consumer based upon the speaking activity of user.

Yes you are right about signaling server being overloaded with messages datachannels are good way to go.

I’m saying that you’d likely inherit a lot of false positives with this approach as speakers are determined by comparing users dBvo; loudness is no way to determine who is speaking. A user could have a fan blaring at 0dBvo while the actual talking user is at -8dBvo.

Essentially once a user is louder than -127dBvo they are now considered talking even if they sitting back and the softest sound is playing.

That’s all. What you’re after is possible but it’d need to be approached uniquely and handled most appropriately.

I think for something like this to work, you’d want to implement a talking-stick system, a user would request to talk and get slot 1 of 3; when all slots are filled, have user request for talking go to all three talkers and one can pass them their stick.

If user is accepted to talk, create there transports and produce it immediately. I think this would solve your problems in a 100+ user room. There’d be three active producers at most and all you have to sort from there is handling 500+ users with all that consuming.

This looks good but this limits the number of slots I mean even if we make 10 slots still this will be considered a limitation. If we can determine accurately the speaking thing then the pause/resume will be the best way to go with. But you said that this can mislead in some scenarios like you mentioned above. I will have to test it to see how it can benefit us.

I really don’t like your idea, no offence. If being on your platform had to be a matter of who can scream loudest and longest on the microphone gets resumed–I’d not enjoy the platform at all. No logic I think human can program just yet can decipher the most likely candidate talking without favorites.

This is like me talking to two people and they’re cutting each other off, who do I hear out and why? lol

Any progress however I’d sure love to hear, that could be some wicked optimizations there!

Can you please explain this ‘cutting each other’ thing. Like if two people are talking simultaneously then I will resume the consumers of both of these users. I will be able to hear them both simultaneously. How can their voices cut each other? This is possible only if I am using ‘active speaker observer’ for this purpose which will give me dominant speaker and can cause cutting effect as you described. But it will be all ok if I using ‘AudioLevelObserver’. Is this right?

Yes, the dominant speaker would provide you the loudest speaker, if the other user that was talking was quieter; the mechanism would cut them off.

The same happens if you base off the AudioLevelObserver and equate such from dBvo which is just comparing loudness.


For this to ever work the way you want it to there needs to be more in-depth comparison of the audio streams and hate to say this but there’s not really any software/voice recognition capable of detailing each voice in a party setting. So it’s not really possible yet and may not ever be, it’s far to complicated in that respects.


At first this seems great but yeah it’ll flaw unless those issues can be solved.

Thank for the explanation, dominant speaker will surely cause issue but I will experiment with AudioLeveLOverserver to see the results. I will update here. Thanks