handle v3 node crash with live rooms

I am still at the experimentation phase with mediasoup.
My current idea is to put a cluster of v3 nodes behind an AWS ALB, using the roomID for stickiness.

Is there common practice on the client side to handle a crash of a node handling a live room? Is that possible for the clients to handle nicely the event without impacting the users too much, with the help also of some server side code that would restore some state as needed or at least understand the particular signaling from the client in that case?

Or all of this is basically left to the implementation on top of the mediasoup library/server and everybody has its own way of doing it depending on the use case? (which I feel is the case and how mediasoup is meant to be used)

Thanks for any insights for a new user like me.

Think the idea is flawed because the ALB will be a bottleneck where all streams go before distribution.
Also AWS bandwidth prices are damm expensive, not sure u aware of it.

Let me clarify.
The LB is just on the signaling part.
All streams are directly connected to the mediaserver nodes holding the rooms via public IPs.

Yes, at least as it stands right now. You’ll have to get into (some sort of) devops for your solution to handle gracefully node failures. Also you might have to keep (some sort of) state of your live rooms, that you can use to recover afterwards. You might need to spin up a new machine, handle connections and so forth. Also scale down when resources are not needed, so your cloud bill doesn’t go through roof.

1 Like

How would this work? I remember looking into this some time ago, but I gave up because I couldn’t find a way to use the roomId to pick the target instance. Let me know if you figure it out, because I’d be interested.

Also, I’ll echo what aetheon said. AWS bandwidth is about 10x more expensive than competitors (like DigitalOcean). I’m currently looking to migrate away, for this reason.

1 Like

Bandwidth cost will be a concern at some point, but other constraints keep me in AWS for now.
But thanks for the reminder it’s gonna hurt :slight_smile:

The way I see it working so far is: I’ll make sure the frontend code adds a cookie specific to the room (roomID or something else the app would manage). Enable stickiness in the AWS ALB listener. First then the ALB picks the target group, always the same when stickiness is enabled AFAIK. Then the target group picks the same node (it may be the ALB under the hood, not sure), based on the cookie.

At least that’s my understanding so far. I still have to give it a try with mediasoup. But I believe that’s how the v3 has been designed to scale horizontally.

There’s loads to be said in regards to horizontal scaling, there’s loads of posts here in the forum.

Mediasoup provides you APIs to be able to achieve horizontal scaling, it was not designed for it per say as its a library.

Just to understand how difficult horizontal scaling is check out kranky geek videos on the subject, they provide great insight into different options,…

Thanks for the link!
It is indeed not an easy topic and depending on many factors.
Glad I confirmed some of my thoughts in this thread.
One step at a time…

[shameless plug]
I’m working on Mafalda SFU , a solution for vertical and horizontal scaling build on top of Mediasoup. Horizontal scaling is not fully finished, but vertical scaling can help you on this topic with a powerful server.

1 Like

Bandwidth cost would be your first worry, the day you host it with 12-24 viewers… heh;
I blow through almost 30-100+TB a day with just 27 cores.

But for handling it’s best done client-side, if user see’s lag they can restartice, this process should connect them but there should be a means to know if the media-server is offline via a broker or chat-server ping/pong to it. Your signalling will tell users if server is offline, and user will determine if lagging to request new ice.

For scaling, there’s so many different ways to go about this. But to help those understand…
0 - 1,000 users maximum user base.
You want simple and for that a broker server is to be written to allow all chat, producer, consumer servers to connect, it then routes all the data to the respective servers (Optimal NoSQL solution).

This solution is savage and is really cool for keeping zones connected but if over-loaded it lacks significantly. We cannot just keep adding servers here… (slows down and gets complicated fast)
1,000 - 50,000
You can still use the broker method but power will base on core strength and ability to queue a high level of messages per second, this is the peak though you’ll need to consider dedicated resources so the power linearly scales. Imagine 1,000 servers connected and routed with a broker + the user messages per second. This is similar to a pub/sub the IO rate is not scalable.
50,000 - 100,000++
This is enterprise level and best suggestion is to get every server to hit a read-only node of MySQL or your preferred database and determine where they should connect (making this dedicated). This method keeps connection count very low. Media-servers connect to chat-server directly and provide their service. Chat Server keeps ping/pong and can tell users failing servers!

This approach takes some smarts/moneys all sorts but if you figure it out, it’s stable performance from 1 room to millions of rooms!

side note
There’s no perfect solution obviously if we compare discord to facebook to twitch to many services right, but that should be what encourage developers is that they can re-define this standard.

Testing/studying on a user-base of hundreds (paying every cent into it) I can confidently share this to help folks.

1 Like

For hosts, here’s your go-to currently if you are running up the bill and I’m no affiliate (this could change any day but you want unlimited).


If maybe we can offer a host section, or refer that’d help loads of you out but enjoy those two! :slight_smile: