I’m trying to restore app routers after a mediasoup crash. For me when a crash happens all the workers die (don’t know why, still investigating) so I need to recreate them from the ground and then recreate routers. Here’s what I’m doing (the code has been simplified ignoring async, await, etc):
function createWorkers() {
for (numCpus) {
const worker = mediasoup.createWorker();
worker.on('died', () => {
if (allWorkersDied) {
createWorkers();
}
});
}
}
function createRoom() {
const worker = getNextWorker();
if (worker.closed) {
// probably all the workers are dead so wait for them to be recreated
waitForWorkers();
}
const router = worker.createRouter();
router.on('workerclose', () => {
router = createRoom();
});
}
The problem is that Router.workerclose event occurs before Worker.died and before workers are created so I need to wait for the workers to get ready and it’s not my preferred. Attempting to create workers on Router.workerclose is also not wise due to potential of race conditions and code complexity.
ITOH dying all the workers on a crash could not be always true so we can’t conclude if a worker is closed then all the other workers are closed too.
Yes. We do not experience crashes at all in our deployments. But if you get crashes, please enable code dumps (in Linux) and check the logs when worker “died” event happens. It should not happen at all.
The crash happens rarely and I’m trying to enable core dump for the next crash (I guess that’s my code fault). However it’s wise to handle the crash instead of process.exit().
To be perfectly clear: if you get any worker “died” event it’s NOT your code fault (unless you are setting super low kernel limits or something like that, which is unlikely to happen if you are testing your app in localhost with a few peers). I insist: worker “died” event can not be caused by any JS API usage.
That’s just what the demo app does. mediasoup does not call any process.exit().
I’m stuck here. When Router.workerclose event occurs the Worker.closed is still false but attempting to create a new router throws “Channel closed”. Some thing is wrong.
As you can see here, before the worker instance emits “died” event, it calls this.close() which sets this._closed = true; and then iterates all routers and calls router.workerClosed(); on all them.
So then, when router.workerClosed() is called and emits router.on("workerclose") event, worker.closed must already be true.
So, if the worker has a router and the worker dies (due to a non yet identified bug), the order of events is:
router.on("workerclose") event is fired. When this happens worker.closed is already true.
worker.on("died") event is fired.
Anyway, please focus on the issue core dump or whatever Ubuntu comes with. As said before, mediasoup-worker crashes should NOT happen and, if there was any bug, we’ll fix it ASAP.
Thanks for the clarification. I ended up with the following code to handle the unexpected worker dying:
function createWorkers() {
for (numCpus) {
const worker = mediasoup.createWorker();
worker.on('died', () => {
if (allWorkersDied) {
createWorkers();
createRoom()();
}
});
}
}
function createRoom() {
const worker = getNextWorker();
const router = worker.createRouter();
router.on('workerclose', () => {
// can not attempt to create the router because the worker
// has not received 'died' event yet so no worker is available
waitForWorker();
});
}
Regarding the core dump I’m waiting for the next crash