Question Regarding Active Speaker Observer Implementation

Hi All,

First of all, I would like to start my first message by thanking you for providing such a quality product.

If I continue, I was playing and testing Dominant Active Speaker feature. It was working very good untill someone joining with a PC which which has a continuous fan noise. We realized that if there is continuous noise background it is becoming difficult to switch active speaker. Also,I have tried adding a different PC which has a continuous noise louder than that. But mostly, speaker did not change.

I had started reviewing cpp code and reading the article on which the implementation is based. I have also checked Jitsi implementation. I would like to share one my question regarding implementation (I hope it is ok about discussing code on this forum. Otherwise I can post where you want)

Within CalculateActiveSpeaker method, implementation divides dominantspeaker’s score to speaker’s score . When I look at the article algorithm and jitsi code, I see the opposite. I did not look in detail other important helper methods. So compared to them maybe this logic is correct. I just want to ask if the situation that occurs when there is noise in the background is related to this or not.

Can you please share your opinions?

for (int interval = 0; interval < this->relativeSpeachActivitiesLen; ++interval)
{
      this->relativeSpeachActivities[interval] = std::log(
      dominantSpeaker->GetActivityScore(interval) / speaker->GetActivityScore(interval));

}

PS: I know you do not like mentioning. But maybe a quick check of the post by @steve.mcfarlin could give an instant idea.

It was working very good untill someone joining with a PC which which has a continuous fan noise. We realized that if there is continuous noise background it is becoming difficult to switch active speaker.

There’s a lot we can do in the audio world to over-come this, like applying gates/compression/limiters. If the audio is blasting through fan or w/e; it’s a detected dB and determining factors for active speaker could be skewed correct. There’s no perfect solution and you’ll see varying attempts made at this.

Overall lots to consider and further tweaks could be required but I’d truthfully put that on the developer to sort out. For example some client level tweaks could be made to enhance this feature more and offload performance needs to the user.

Thanks for answer. Actually my main question is on that “asking the developer”. What I want to ask is the piece of code I shared above correct? Because when I compare it with Jitsi and algorithm, I see the opposite.
if my guess is correct maybe the mediasoup active speaker code will need to be updated.

PS: I have also tested today, without fan noise. Both participants was using quality headphone. And I I couldn’t see active speaker events properly.
It’s not always easy to change when someone takes the role of an active speaker and stays silent, then when someone else speaks. I suspect is it just the above piece of code causing this? If so, with a minor change, mediasoup will have this nice implementation/feature correct.

It’s possible and you could make the revisions your self and share, I certainly couldn’t say what’s optimal in this case too many scenarios to factor in and the fun issues as you’re saying with competing audio. haha

They may review and respond.

You are not steve mcfarlin right ? :slight_smile:

Can you comment in the corresponding and already merged PR about this by also referencing the different code in Jitsi SFU?

1 Like

If you reverse the division as it is in Jitsi then you would need to insure the audio level ranges between -127 and 0. RFC-6464 “The audio level is expressed in -dBov, with values from 0 to 127 representing 0 to -127 dBov.”

I did not debug Jitsi, nor look deeper to see if the audio levels were expressed in the negative rage. With negative values the division will need to be reversed. If you would like to see this calculation output then put a MS_DUMP statement to output the pre and post calculated values.

There is nothing that can be done inside the algorithm to account for background signal levels of a fan or from any source. The audio signal would have to be filtered before sending to the server as this algorithm uses the RTP Audio Level extension. If you want a more robust solution that did not require client side background audio filtering, then you would need to do the following in MediaSoup: Decode Audio → Filter Audio → Calculate Dominant Speaker. In the last part you could possibly replace the the input to use a sample of the signal. Basically change the algorithm to work on the audio signal directly.

You may also look into using RNNoise for background noise reduction.

I would expect something like fan noise to be at least partially removed by WebRTC noise cancellation mechanism and ideally people that don’t speak would also mute themselves. Without decoding and analyzing audio server wouldn’t be able to distinguish any of that, so it not really a mediasoup-level issue.

Thanks for the explanations Steve. They were very helpfull.
My main concern was not solving the background noise issue. I just did some test while one participant had continuous background fan noise and other one always talking loudly and screaming loudly and continuously also. I was at least expecting switching dominant speaker to louder one . When i do not see this behaviour i have checked the implementation and arised that question about the difference in implementation. You may be right because audio levels are less than zero.

At this point I do not have any idea why that test is not worked . Maybe I try some new things with audio level observer.

Apologies for not getting back sooner. We slightly modified the algorithm. Specifically the division you pointed out. I had implemented the reverse mistakenly. This is now fixed.

Thanks a lot @steve.mcfarlin