I’ve been trying to test how my SFU cluster behaves under a specific scenario:
100+ clients connect to the same router
1 Client turns on 1080p video+audio
In my experience, 1 router (1 worker) cannot handle that many clients at once. To deal with this scenario, I added code to poll worker load and move clients onto other workers/nodes if the worker load is above a threshold (90%). However, I find that the worker’s CPU load is seldom above this threshold.
I would assume that the mediasoup-worker would be tirelessly working to try and route packets to as many consumers as it possibly can, but somehow that doesn’t seem to be the case. Instead, all clients are receiving streams at around 2fps with the CPU only occasionally crossing 90% CPU load.
What is the expected behavior when a mediasoup-worker is overburdened with too many clients?
Does the sender use simulcast for screen sharing or a single stream?
I have two producers (audio+video), but the video is just a single stream with video/vp8 codec.
Have you checked sent fps in sender side (via Producer stats)?
I’m not sure how to check this. I got the following from chrome://webrtc-internals. It looks like the sender is only able to send ~5-12fps. I’m not entirely sure why…
The issue I’m running into is that, counter-intuitively, the CPU is not being fully utilized. I expected it to be, but that just doesn’t seem to be the case. I can confirm that iotop shows barely any disk/swap usage, so that’s not it. The server is in the cloud, and upload bw is only at 50~80MBps, which is well below the 450MBps I have witnessed when all workers are in use, so it isn’t a network bottleneck either.
Is there any way to determine what the worker is doing? Any way to debug this?
There’s another comment where I posted code to compute CPU usage on the SFU backend. This load value as well as top report that CPU usage is well below 90% when a worker has 100+ clients to handle, as described in the original post.
I believe what @footniko has said about your CPU on the client side might be true. I have witnessed this myself when I keep adding tabs each tab creates a decoder for the incoming stream and if the CPU gets overloaded the frame rate drops. So if 13 is a good number I would suggest opening another 13 tabs on a different computer.
If I drop the total number of clients, the framerate increases. I can still maintain 8 tabs/host, but if I set up only 2 hosts (8x2 = 16 tabs), everything works great. All tabs receive 25 fps, which is what the producer is producing at.
it says that around 200-300 is the limit sort of. Which is not the case from the stats that you’re putting forward.
Although to be clear we have 96 tabs receiving one audio and one video stream so that makes 2 consumers per tab so you actually have 192 consumers.
By switching off audio, we take it down to 96, which is somewhere around 100, does that improve things in the same way scaling down clients does? that’s the question I’m asking.
I understand that those those numbers quoted there obviously depends on the underlying CPU and may vary as a result. I will try what you are suggesting and report back.
That being said, I still don’t understand what the system is doing in this scenario. Clearly the CPU is not overloaded. Neither is memory, nor IO. Putting this together with what’s present in that document makes me think that there is some kind of throttling going on somewhere, possibly on the producer end.
Perhaps it is sending/receiving far too many PLIs/FIRs and that is manifesting itself in terms of reduced CPU load on the SFU server? Is there a way I can detect whether this is what is happening?