I’ve been trying to wrap my head around mediasoup scaling and combing through the resources available to understand the constraints. I put this together for my own understanding but if it’s correct, I hope it might also be useful to other people who join the community. I’d appreciate any feedback and also happy to refine this into something more complete, maybe add some diagrams if it’s interesting to people.
This information is not the result of benchmarking myself, instead this is a result of reading the scaling documentation, reviewing forum posts like this one, and occasionally diving into the source code to understand what’s going on under the hood. Here we go.
The bottleneck for mediasoup is CPU. Mediasoup workers can be created for each CPU on the machine and the application is responsible for managing load across workers to make sure that there’s enough capacity to handle every person on every call.
A worker can have multiple routers, each created by a worker for a specific call to send media between people on the call. Since this is CPU bound and capacity is based on the worker, creating multiple routers for the same call on the same worker does not increase capacity. However there are ways to pipe media between workers if the call needs to be bigger than one worker can handle. To do this, you create routers on multiple workers for the call and pipe media between them using the router’s [pipeToRouter](https://mediasoup.org/documentation/v3/mediasoup/api/#router-pipeToRouter)
function. Not only can this pipe media between workers on the same machine, but also between machines as well; however, just as with routing any data through a network, there is a latency penalty when doing so. One open area of research for me is to test out how much latency there is piping media between different machines within the same VPC and see if it’s too slow for a live video call to feel real time. It should be fine for a unidirectional stream though, like building a twitch clone because there’s no back and forth and it doesn’t even really matter if the audience is seeing things at exactly the same time.
People who want to join calls connect to routers by creating transports. They then send media to the router through a transport by creating producers and receive media from the router through a transport by creating consumers of one of those producers. Consumers appear to be the CPU intensive part of this process. This is another area of research for me. The docs recommend limiting each worker to 500 consumers. This appears to be a pretty squishy number though. For instance, anecdotally, if all of these consumers were audio instead of a blend of audio and video, the worker may be able to sustain ~20x consumers. Variance in the actual hardware may also impact this. I don’t yet have a competent way to measure how healthy a worker is in real time and what its actual capacity is, however I’d like to develop one.
One of the reasons this is tricky is because mediasoup is an SFU which means that every user is consuming everyone else’s media individually. This means that consumers scale quadratically with the number of people on the call. If every user produces audio and video and consumes everyone else’s audio and video, the formula for determining how many consumers are needed on the call is:
n* 2(n-1) or 2n(n-1)
where n is the number of people on the call. Everyone (n) consumes an audio and video stream from everyone else (n-1). The quadratic scaling makes packing calls across the CPU bound workers hard because a four person call only requires 24 of the 500 consumer capacity, but an 8 person call takes 112, and a 16 person call takes almost the whole worker at 480! Every call starts as a single person and grows from there. If the service is incapable of knowing the size of calls ahead of time, allocating and protecting space is hard without assuming every call is going to be large and leaving an excess of CPU unutilized in reserve.
One way to continue to scale beyond 16 person calls is to leverage the ability to pipe media between routers. The challenge here is that regardless of the router the media is produced on, call participants continue to create consumers of media on the router their transport is associated with. That means that in the 16 person call example, it is not sufficient to simply create and pipe media to a new router for the 17th person because the original 16 people would still need to consume the new person’s audio and video; this additional 32 consumers puts the first router above the 500 consumer heuristical limit. I don’t believe the call would fail for everyone if this happens, but I do think that new consumers may not be created successfully. Therefore the application needs to reserve capacity for everyone on the router to consume audio and video from everyone in the call on other routers. Additionally, under the hood, the pipeToRouter
function creates a consumer on the source router that media is being pipped from, which needs to be included in our capacity overhead as well.
The formula for the number of consumers required on a particular router when utilizing multiple routers for a call is as follows:
2n(n-1) + 2n(m) +2n(r-1) or 2n(n+m+ r-2)
where n is the number of people on the call whose transports are on the current router, m is the number of people on the call whose transports are on other routers, and r is the number of routers handling the call. This changes capacity planning pretty significantly.
To accomplish a 17 person call, the application could start using an additional router after the 14th person joins:
routerA = 2n(n+m +r-2) = 214(14+3+2-2) = 2817 =476
routerB = 2n(n+m +r-2) = 23(3+14+2-2) = 517 =85
And to accomplish a 30 person call, the application could start using additional routers after every eighth person joins and spread the call across 5 routers
routerA = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462
routerB =2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462
routerC = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462
routerD = 2n(n+m+ r-2)=27(7+23+5-2)= 1433 =462
routerE = 2n(n+m+ r-2)=22(2+28+5-2)= 433 =132
There are a couple of caveats here. Some work for us, but some work against us.
- People want to produce and consume additional media such as screen sharing and watching movies together. This can make things much less predictable from a capacity planning standpoint.
- Not all producers and consumers may be necessary at all times, if the application doesn’t need them it can remove them.
- Producers and consumers that are paused may not have any CPU cost. It may be practical to create consumers and producers to stream everyone to everyone without actually resuming that media until the application needs it.
Mediasoup wasn’t really designed for large calls. The creators have expressed that it’s a trade off they made consciously because they believe a 100 person video call where everyone can talk to everyone is a pretty unpleasant experience and Zoom probably has that base case pretty well covered. Instead the developers seem to recommend trying to figure out how many people actually need to be speaking or presenting at any given time and using that to calculate whether it is manageable given the worker limitations. Workers won’t actually be overloaded if only one or two participants are actively streaming media at a time, even if they’re being consumed by hundreds of people. Applications can do things like paginating what is visible to consumers to mitigate this burden, even for situations with multiple live videos.
I wrote all of this out to try to further my own understanding of how the technology works so I can start to figure out tradeoffs for my own application, but hopefully it’s also helpful to other people trying to figure out what their scaling needs are. I’d love feedback to help to refine this into a resource that’s easily accessible to people trying to build with mediasoup, or simply find out if it’s the right tool for them.
My specific use case is trying to make a social video tool that allows people to dynamically move between smaller video conversations while still seeing the liveliness of the greater call space as side videos. We’re trying to think hard about the packing problem because it is important to us that people can scan through pages of live video of people having conversations and see enough of what’s going on to join one.
However I’m still trying to understand a few things:
- Is there any way to actually see the health of my worker? Is it simply adding consumers and watching CPU utilization go up? Or is the best bet to experimentally figure out how many consumers can actually live on a worker given the hardware of my cloud provider and then keep counts of active consumers per worker and try to stay below that.
- How bad is latency between machines in a VPC? If it’s bad and everyone from a call needs to stay on the same machine, then the service needs to scale with large machines, but if scaling horizontally is still performant than the service can scale more incrementally with small machines and pipe when necessary
- Are created consumers costly or only unpaused consumers actually consuming media? Should we invest in creating and destroying consumers on the fly or creating them and simply pausing and unpausing them when needed
- Are there any failure cases that will ruin calls across workers for everyone that are ongoing? or do these limits simply prevent additional consumers and producers from being created.
- Are there smooth ways to migrate users across routers automatically in case two calls are on the same router that risk using up all of the available CPU?
Without totally understanding some of the limitations it’s difficult to understand which trade offs to take when truly scaling large calls up. It’s even harder to manage if there are going to be large calls in combination with small calls and the service is going to pack calls without knowing how big they’ll be in advance. I’ve listed some experimental options we’re considering trying out to see if they provide us with the functionality we’re looking for.
- Spreading participants on calls evenly across all possible workers. This would ensure that each additional participant only consumes worker capacity linearly with the number of people on a call. If the latency penalty for piping across machines were small enough, this could be extended to any worker on any cpu in the cluster. Here we’d trade creating way more consumers/CPU/doing a ton of piping for predictability and more linear consumption on workers
- Create multiple transports for call participants so that in the case where they run out of capacity on a worker, they can continue on a new worker
- Transitioning a person to another router by duplicating their streams and trying to switch them out as quickly as possible on the frontend