Long story short
We open mediasoup-based webapp (either our own or mediasoup demo) in a browser running inside Docker container, and after some time (20-30 minutes) videos starts to freeze one by one in it. No errors in browser console, nothing suspicious in Firefox’s about:webrtc
.
Long story long
The challenge
We need to transmit MediaSoup-based multi-user webinars to large audiences (thousands of people). At the same time there are only a few active users (speakers), hence to save on traffic and server CPU and also support webinar recording we’re using HLS as the main delivery mechanism for passive participants.
The solution
We came up with the following idea: instead of dealing with streams and muxing them manually in some separate server, we launch a real browser instance, capture its screen and audio, and transmit to existing mediaserver for delivery. All this is being done inside Docker container (I’ve published our implementation for this at my GitHub: dockerized-browser-streamer), in production we launch it as AWS ECS task. While it consumes a lot of resources, it allows us to easily experiment with layouts, etc. We call this component The Streamer.
The problem
Although everything is working perfectly on real user browsers, Streamer has annoying issue: after 20-30 minutes it starts to lose some participant’s media streams (usually camera or screenshare, sometimes audio) which in UI looks like frozen picture, and doesn’t reconnect (e.g. if participant will switch camera off and on, or re-share its screen or other window, streamer will display same old frozen picture again). It usually happens after people starts or stops screenshare or even switch their mic on or off (so if frozen participant unmutes themselves, their video may unfreeze and continue to play probably because it adds audio consumer to the same transport as video).
We tried both the latest Firefox and Chrome, tried TCP-only ICE candidates, but with no luck yet.
There is nothing in the browser’s devtools console and nothing suspicious we can see in FF’s about:webrtc
page (however, we don’t understand much there).
These bugs also reproduces on Mediasoup Demo application (also after being used for 20-30 minutes by several people).
I believe that this kind of weird frontend bug or some common logical or concurrency error in client application (so it exists in both our app and mediasoup demo). Or some weird network things (but why after 20 minutes?).
We got stuck and looking for help (including paid, see this message in Job Opportunities) or any advice where to start digging.
The question
The main question is How to detect the source of the problem? Browser console logs don’t show anything.
Maybe, someone could suggest other ways of localizing the problem? Or maybe anyone have seen something like this before?