Mediasoup Scaling resource

I’ve been trying to wrap my head around mediasoup scaling and combing through the resources available to understand the constraints. I put this together for my own understanding but if it’s correct, I hope it might also be useful to other people who join the community. I’d appreciate any feedback and also happy to refine this into something more complete, maybe add some diagrams if it’s interesting to people.

This information is not the result of benchmarking myself, instead this is a result of reading the scaling documentation, reviewing forum posts like this one, and occasionally diving into the source code to understand what’s going on under the hood. Here we go.

The bottleneck for mediasoup is CPU. Mediasoup workers can be created for each CPU on the machine and the application is responsible for managing load across workers to make sure that there’s enough capacity to handle every person on every call.

A worker can have multiple routers, each created by a worker for a specific call to send media between people on the call. Since this is CPU bound and capacity is based on the worker, creating multiple routers for the same call on the same worker does not increase capacity. However there are ways to pipe media between workers if the call needs to be bigger than one worker can handle. To do this, you create routers on multiple workers for the call and pipe media between them using the router’s [pipeToRouter](https://mediasoup.org/documentation/v3/mediasoup/api/#router-pipeToRouter) function. Not only can this pipe media between workers on the same machine, but also between machines as well; however, just as with routing any data through a network, there is a latency penalty when doing so. One open area of research for me is to test out how much latency there is piping media between different machines within the same VPC and see if it’s too slow for a live video call to feel real time. It should be fine for a unidirectional stream though, like building a twitch clone because there’s no back and forth and it doesn’t even really matter if the audience is seeing things at exactly the same time.

People who want to join calls connect to routers by creating transports. They then send media to the router through a transport by creating producers and receive media from the router through a transport by creating consumers of one of those producers. Consumers appear to be the CPU intensive part of this process. This is another area of research for me. The docs recommend limiting each worker to 500 consumers. This appears to be a pretty squishy number though. For instance, anecdotally, if all of these consumers were audio instead of a blend of audio and video, the worker may be able to sustain ~20x consumers. Variance in the actual hardware may also impact this. I don’t yet have a competent way to measure how healthy a worker is in real time and what its actual capacity is, however I’d like to develop one.

One of the reasons this is tricky is because mediasoup is an SFU which means that every user is consuming everyone else’s media individually. This means that consumers scale quadratically with the number of people on the call. If every user produces audio and video and consumes everyone else’s audio and video, the formula for determining how many consumers are needed on the call is:

n* 2(n-1) or 2n(n-1)

where n is the number of people on the call. Everyone (n) consumes an audio and video stream from everyone else (n-1). The quadratic scaling makes packing calls across the CPU bound workers hard because a four person call only requires 24 of the 500 consumer capacity, but an 8 person call takes 112, and a 16 person call takes almost the whole worker at 480! Every call starts as a single person and grows from there. If the service is incapable of knowing the size of calls ahead of time, allocating and protecting space is hard without assuming every call is going to be large and leaving an excess of CPU unutilized in reserve.

One way to continue to scale beyond 16 person calls is to leverage the ability to pipe media between routers. The challenge here is that regardless of the router the media is produced on, call participants continue to create consumers of media on the router their transport is associated with. That means that in the 16 person call example, it is not sufficient to simply create and pipe media to a new router for the 17th person because the original 16 people would still need to consume the new person’s audio and video; this additional 32 consumers puts the first router above the 500 consumer heuristical limit. I don’t believe the call would fail for everyone if this happens, but I do think that new consumers may not be created successfully. Therefore the application needs to reserve capacity for everyone on the router to consume audio and video from everyone in the call on other routers. Additionally, under the hood, the pipeToRouter function creates a consumer on the source router that media is being pipped from, which needs to be included in our capacity overhead as well.

The formula for the number of consumers required on a particular router when utilizing multiple routers for a call is as follows:

2n(n-1) + 2n(m) +2n(r-1) or 2n(n+m+ r-2)

where n is the number of people on the call whose transports are on the current router, m is the number of people on the call whose transports are on other routers, and r is the number of routers handling the call. This changes capacity planning pretty significantly.

To accomplish a 17 person call, the application could start using an additional router after the 14th person joins:

routerA = 2n(n+m +r-2) = 214(14+3+2-2) = 2817 =476

routerB = 2n(n+m +r-2) = 23(3+14+2-2) = 517 =85

And to accomplish a 30 person call, the application could start using additional routers after every eighth person joins and spread the call across 5 routers

routerA = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462

routerB =2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462

routerC = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462

routerD = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462

routerE = 2n(n+m+ r-2)=22(2+28+5-2)= 433 =132

There are a couple of caveats here. Some work for us, but some work against us.

  • People want to produce and consume additional media such as screen sharing and watching movies together. This can make things much less predictable from a capacity planning standpoint.
  • Not all producers and consumers may be necessary at all times, if the application doesn’t need them it can remove them.
  • Producers and consumers that are paused may not have any CPU cost. It may be practical to create consumers and producers to stream everyone to everyone without actually resuming that media until the application needs it.

Mediasoup wasn’t really designed for large calls. The creators have expressed that it’s a trade off they made consciously because they believe a 100 person video call where everyone can talk to everyone is a pretty unpleasant experience and Zoom probably has that base case pretty well covered. Instead the developers seem to recommend trying to figure out how many people actually need to be speaking or presenting at any given time and using that to calculate whether it is manageable given the worker limitations. Workers won’t actually be overloaded if only one or two participants are actively streaming media at a time, even if they’re being consumed by hundreds of people. Applications can do things like paginating what is visible to consumers to mitigate this burden, even for situations with multiple live videos.

I wrote all of this out to try to further my own understanding of how the technology works so I can start to figure out tradeoffs for my own application, but hopefully it’s also helpful to other people trying to figure out what their scaling needs are. I’d love feedback to help to refine this into a resource that’s easily accessible to people trying to build with mediasoup, or simply find out if it’s the right tool for them.

My specific use case is trying to make a social video tool that allows people to dynamically move between smaller video conversations while still seeing the liveliness of the greater call space as side videos. We’re trying to think hard about the packing problem because it is important to us that people can scan through pages of live video of people having conversations and see enough of what’s going on to join one.

However I’m still trying to understand a few things:

  • Is there any way to actually see the health of my worker? Is it simply adding consumers and watching CPU utilization go up? Or is the best bet to experimentally figure out how many consumers can actually live on a worker given the hardware of my cloud provider and then keep counts of active consumers per worker and try to stay below that.
  • How bad is latency between machines in a VPC? If it’s bad and everyone from a call needs to stay on the same machine, then the service needs to scale with large machines, but if scaling horizontally is still performant than the service can scale more incrementally with small machines and pipe when necessary
  • Are created consumers costly or only unpaused consumers actually consuming media? Should we invest in creating and destroying consumers on the fly or creating them and simply pausing and unpausing them when needed
  • Are there any failure cases that will ruin calls across workers for everyone that are ongoing? or do these limits simply prevent additional consumers and producers from being created.
  • Are there smooth ways to migrate users across routers automatically in case two calls are on the same router that risk using up all of the available CPU?

Without totally understanding some of the limitations it’s difficult to understand which trade offs to take when truly scaling large calls up. It’s even harder to manage if there are going to be large calls in combination with small calls and the service is going to pack calls without knowing how big they’ll be in advance. I’ve listed some experimental options we’re considering trying out to see if they provide us with the functionality we’re looking for.

  • Spreading participants on calls evenly across all possible workers. This would ensure that each additional participant only consumes worker capacity linearly with the number of people on a call. If the latency penalty for piping across machines were small enough, this could be extended to any worker on any cpu in the cluster. Here we’d trade creating way more consumers/CPU/doing a ton of piping for predictability and more linear consumption on workers
  • Create multiple transports for call participants so that in the case where they run out of capacity on a worker, they can continue on a new worker
  • Transitioning a person to another router by duplicating their streams and trying to switch them out as quickly as possible on the frontend

Paused Consumers do not send RTP so there is CPU saving here, yes.

There is no magic way to do that other than make it look as “transparent” as possible for the user. Behind the curtains you need to recreate transports, producers and/or consumers in other workers/routers, etc.

Do you think that in Zoom you receive the audio and video of 100 participants at the same time? The answer is not, and the same can be done with mediasoup by pausing producers or consumers.

Just wanted to share some of my own thoughts/strategies here as this is something I have been investigating on and off for a while. It’s great that you have outlined everything so nicely as a reference.

For our current use case the focus is large audio calls and we came up with a pretty simple/naive “sharding” strategy to break up a “room” across routers/workers.

I would say the most important thing before proceeding further is to establish a reliable testing infrastructure. For this I made a test harness using an mp4 audio file and used the mediasoup-aiortc client to connect different #'s of clients to a remote EC2 instance. It’s not really something I can share around though as it has a lot of our application logic embedded in it as far as creating “fake” users, etc. I initially used this to determine the CPU limitations of a single worker for an N person audio call w/ all users “talking” and consuming audio at the same time - to err on the side of over-provisioning.

Our current strategy is focused on scaling locally on a single machine and for now we just are more conservatively scaling up machines. We have a selection algorithm to put “new rooms” on the least-loaded server. So we provision new instances when all existing instances are over ~70% capacity which leaves a reserver capacity for the existing rooms to add more participants. For us, capacity is a measurement of both CPU and bandwidth use. We use the pidusage package to measure the CPU pressure for each worker process, and gather ingress/egress bitrate stats from all active transports every 5 seconds and expose this data from each server to a centralized orchestration service which is where users request an endpoint from.

Our per-machine scaling strategy is fairly naive, but it works something like this:

From testing we established a value we’ll call CONSERVATIVE_LIMIT for the # of participants we want to each router to serve. This is configureable and we actually still experiment with different values. A good starting point might be ~16. I found keeping this value a power of 2 is helpful just for conceptualizing everything. You want a value that is both low enough to “guarantee” good performance, but high enough that the overhead of router piping does not cancel out the benefits.

  • First user connects, provision a new router for this “room”.
  • Each time a user requests a connection to this “room”, we attempt to put them on a router that is serving less than CONSERVATIVE_LIMIT other participants. We pipe their producer to any other active routers.
  • If no such router exists for this room, we create a new one. Since we maintain stats on all workers (cpu, # producers, # consumers, etc), we use these to select the “least loaded” worker on which to place this router.
  • If the selected worker matches that of any existing room routers, we bail out and just use the existing router on that worker (no point in creating 2 routers on the same worker)
  • Otherwise, we create the new router and setup the necessary pipes between it and all other routers in the room.

Be aware that this approach is going to require some sort of locking mechanism to prevent 2 users connecting at the same time from potentially both getting new routers.

This has worked fairly well for us so far and has definitely spread the load across CPUs much more effectively by reducing the quadratic nature of scaling of a full-participation (n:n) style room.

Our hope (when we have the time to revisit) is that we can eventually expand this naive approach to piping across servers. Possibly even across regions so that we can optimize calls better without forcing users to select their region ahead of time. The logic is essentially all in place to do so, we just need to build out the application-specific signaling mechanisms to take advantage of it.

I don’t think this concept is anything new to you but I did just want to share it to highlight the fact that it is effective. You may want to focus on scaling up per-machine before taking the plunge into scaling horizontally.

My thoughts on some of your questions:

Is there any way to actually see the health of my worker? …

For this we use the observer API and count all added routers/producers/consumers. We ingest all this data into a dashboard (grafana) along with the CPU usage of the worker so we can see in near-realtime and also historically how these values interact to try to find an “ideal” value. I think it is just much easier to find a good static “ideal” value than to try and determine this dynamically as so much depends on the actual amount of RTP traffic flowing at any given time. As I said, this is a value we are still tweaking and change it on different servers and monitor how it behaves.

How bad is latency between machines in a VPC?

I can’t speak to this but would love to hear your results. My hunch is that it would be low-enough to acceptable. I believe the threshold for audio-only is ~200ms of latency before the users start noticing something is off. I would imagine the latency from user to datacenter and back is an order of magnitude higher than inter-VPC traffic.

Are created consumers costly or only unpaused consumers actually consuming media?

In my experience just pausing/unpausing has been enough. It was not worth the extra layers of application logic to try and destroy/create them on the fly - another concern for us was the latency involved here and the risk of missing audio from users while re-creating everything.

Are there any failure cases that will ruin calls across workers for everyone that are ongoing?

I believe you should really only see degraded performance (dropped packets, robotic audio, etc) on any routers that are on a worker that is maxing out it’s CPU. However, there could be cascading effects that impact other workers as the OS tries to manage a process that is taking a lot of CPU.

Are there smooth ways to migrate users across routers automatically in case two calls are on the same router that risk using up all of the available CPU?

This is up to the application. One idea I had been kicking around is basically using temporary pipes to move the users to a new router, then setup new producers/consumers on that router, and then remove the pipe and old transports. Hard to say how seamless this would be in practice as it is a lot of balls to keep in the air.

Good luck!

1 Like

Thanks for the context! That being able to pause consumers as opposed to create/destroy should simplify things for us quite a bit.

I don’t actually know how zoom works under the hood, but this context is helpful! Is speaker view is zoom using individual streams they turn on and off for everyone I’m on the call with, or is it just a single pipe that they send whatever video down they want?

For tile view, zoom is compositing large sheets of people server side and piping one down at a time right?

1 Like

Thanks for the detailed reply! We’re currently on a single machine with a bunch of CPU but have a plan that’s pretty close to yours for horizontal scaling. I don’t have a sense of the warm up time for a server but we’ll have to consider that when we start to build that infrastructure out.

We’re focused on ways we can include more people in a “call” than is typically found in an SFU (closer to 100) and what creative technical and product decisions we can make to help that happen.

I’ve been able to have mediasoup scale pretty well so far. I’ve run two tests so far with screenshare and have shared the results below:

Screenshare Parameters

Resolution: 1920x1080
FPS:        30
Audio:      Yes
Codec:      VP8

Test 1

# Clients:     7x14 => 98 (14VMs; 7tabs/VM)
# Rooms:       1
# SFU servers: 3 (3CPU 1GB, 3 workers)

Test 2

# Clients:     8x22 => 176 (22VMs; 8tabs/VM)
# Rooms:       3 (40, 68, 68 participants)
# SFU servers: 12 (1CPU, 1GB, 1 worker)

In both tests, I ensured that I always got the same CPU (Intel Xeon Gold 6140) on all my VMs. Additionally, all VMs were located in the same datacenter.

Were transitions smooth when clients had to be moved off of one SFU onto another? No. But like Alex and IBC mentioned, I think this is entirely up to the application. In my tests, the downtime was negligible (< 1s downtime). Obviously this may increase depending on geo-location of the servers and application logic. With regard to the latencies, I don’t know how to capture it. If someone can chime in on how to capture latency introduced by hops across multiple SFUs, I can integrate that into my test harness.
That said, I can attest that video latencies at least were negligible.

If there are other tests you want me to run, let me know and I will try and run them for you. It gives me a
chance to revisit my testing harness, make it more abstract and automate more parts of it. Hopefully someday I can integrate/build Kite’s frontend with my harness as the backend.

3 Likes

I recently encountered this fork of mediasoup-demo which has Prometheus monitoring built in:

I wonder if it might help measure the “health” of a worker.