many-to-many scaling

Hello,
I’m fairly new to mediasoup, and managed to build out and test my first mediasoup deployment a few days a go. While I do have logic in place for mediasoup to begin using multiple workers, I opted to disable that logic to test how much load one CPU can take before it needs to fan out. And during this testing, I noticed some strange behavior.

1 CPU maxed out with 12 producers, 60 consumers
I opened up 6 browser tabs, each of which was transmitting webcam+mic streams.
If I understood the math, this should’ve resulted in 12 producers: (6 (tabs) x 2 (webcam+mic))
Each of these tabs also received all of the other tabs’ streams,
thereby leading to 60 consumers: ((6 * (6-1)) * 2)

Given that this number is a far way away from the 500 consumers referenced in the Scalability document, I’m wondering whether this is due to my (poor) configuration, or due to multiple producers. Are producers ( P ) to be treated just the same as consumers ( C ) while doing the math? In terms of scaling, is 1P+3C equivalent to 2P+2C?

Now, obviously, the other variable here is the CPU itself. In case this helps, here’s what I was able to gather from /proc/cpuinfo

vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

Here’s the configuration I’ve cobbled together so far from various internet sources (forgive the bad underscore/camel-case keys):

mediasoup:
  server:
    num_workers: 4
    worker_settings:
      logLevel: warn
      rtcIPv4: true
      rtcIPv6: true
      rtcMinPort: 30000
      rtcMaxPort: 49999
    router:
      mediaCodecs:
        - kind: audio
          mimeType: audio/opus
          clockRate: 48000
          channels: 2
          parameters:
            useinbandfec: 1
            sprop-stereo: 1
            cbr: 1
            stereo: 1
            minptime: 10
            maxaveragebitrate: 131072
          rtcpFeedback: []
        - kind: video
          mimeType: video/VP8
          clockRate: 90000
          payloadType: 102
          parameters:
            x-google-start-bitrate: 1000
          rtcpFeedback: []
    webrtc_transport:
      max_incoming_bitrate: 16777216
      initial_available_outgoing_bitrate: 10000000

Is it just poor configuration? Are these numbers expected due to having multiple producers?
What am I doing wrong?

Your math looks right to me. In our experience, 60 consumers should not be maxing out a CPU on the hardware you’re testing on. (We generally see 10-15% of a CPU, in a 60-consumer test, running on AWS c5 instances.)

Any idea if thisis due to having multiple producers? Maybe producers:consumers is not a 1:1 ratio with regard to the 500 consumers/CPU calculation…

No, I don’t think it’s a producers issue. The 10-15% of cpu number that I mentioned is for a 6-participant call with all cameras and mics on.

2 Likes

Thank you for that.
I think it’s almost certainly down to some kind of configuration issue then. However, I have no idea on how to track down those types of issues though.

I’m not sure if your math is right, the only way to be sure is by debugging it. You can simply add a counter each time new consumers are created. Another way is to enable worker logs and do a count of consumers by id.

What I’m trying to say is that your application logic can behave in ways that others assumptions will be misleading.

One of the practices that the authors suggest is creating paused consumers and resume/pause their activity.

Fair point.

I followed your advice, and confirmed it by doing a count of producers and consumers, and it matched what I referenced above: 12 producers, 60 consumers. Off-topic, it would be really cool if there was a visualization tool for this. I’ll try and spend some time building one out if there isn’t one already.

Just to follow up on my initial results, it seems like the CPU I was running on was pretty outdated. Intel(R) Xeon(R) CPU E5-2620 was launched in 2012. On a more modern CPU (Intel(R) Xeon(R) Gold 6140), I was able to see numbers similar to what @kwindla observed. 12 producers, 60 consumers now use up about 20% of one CPU.

It would be interesting to know the specs of the machine on which we can expect 500 consumers/CPU. I feel it would make for a stronger reference point.

Say for video, I don’t think those specs are possible as a consumer depends on the given producer, which depends on the video quality being uploaded, which affects things like cryptography CPU usage, encoding/decoding if thats the case, …

This said, I don’t know much about these things, have to dig into mediasoup core to see what uses CPU. A valid ask would be to have a list of the things that are using CPU so that people can evaluate their own solutions.

Anyhow would love to know more about this subject too. I wonder if @ibc would accept some kind of crowdsource funding to answer some of this questions in some form like video, blog post, …

We try to make mediasoup as efficient as possible, but I don’t have those numbers. In fact, I expect to learn some of them in this kind of topics.

1 Like

Adding data to the pile here:

CPU
Intel® Xeon® Platinum 8124M CPU @ 3.00GHz

Connection Model
1 Audio; 3 Datachannels per peer
1 Router on 1 Worker/CPU

25/30 Peers connected reliably. CPU begins to hit 100% in spikes near 25, but holds till 30 where its mostly at 100%. Much more and it begins to back off and make static on the audio, but connections don’t drop until it hits a channel timeout error that I purposefully dont catch.

I suppose a point, I’m not sure if datachannel producers/consumers count towards the math the same way was the media producers/consumers, as they seem to be a different type of connection. For what its worth, i was sending data ever 50ms on all datachannels. Data was under a couple hunded bytes.

1 Like

Hi James,

Thanks for the above data, i have few questions here.
Were all peers sending the streams(both audio and video tracks) to all other and receiving stream from all others?

Was there any lag or audio video synchronization problems during the call?

What are you using for signalling between the peers? Socket io?

Sure, I’ll clear up anything I can.

All peers were sending and receiving audio streams/tracks to all other peers. There are no video tracks in my setup.

The only lag/issue i had was when the server got close to max capacity around 30-ish peers connected.

I am using websockets for signaling, but that shouldn’t matter on performance. The signaling just throws around json between the client app and the server to establish connections and such.

1 Like

Hi James,

Do you have ping interval/ping timeout setting for your websocket connections?
And when you reached the peers connected to 30, were there any disconnection of websockets?

As in my case i have set the ping interval/ping timeout to 10/5 seconds. I observe frequent socket disconnections from the server. But my servers cpu utilization is less than 20% and the number of peers is 10-12 and everyone sends and receives both audio and video tracks. I have synchronization issues as well between the audio and video tracks.

I do have timeouts set on the websockets, I’m using the built-in ping/pong methods to detect stale clients. 5000ms (5 seconds) i think is what I have it set to.

Websockets used to disconnect or “choke” when mediasoup consumed a lot of resources, leaving too few for the websocket. I’m developing under nodejs, and so this was a limitation of them being on the same process. I isolated the webserver/websocket from the mediasoup process. There are some technicalities Ill spare you, but this stopped the issue where websockets begin to ‘choke’ when mediasoup consumes a lot of resources on the main thread.

If the websockets are choking when you are not heavily using the server, it could be a whole number of things. Its best to think of signaling very generically. You could easily replace signaling with hand-written signals that you mail via USPS to the server, and it should work the exact same as if you used whatever signaling currently in place.

We may want to start a new thread on this, seeing as we are venturing into other topics. Ping me with a new thread and we can sort some stuff out I think. It sounds like you just need to think about how the websocket connection is working.

1 Like