Best Architecture for Offloading MediaStream from Mediasoup to a Sidecar Node.js Service for Image Processing

Hi team,

We are running a production-grade Mediasoup server and looking to implement a Node.js-based sidecar service (deployed on Kubernetes with HPA) that will perform image processing and detection on all video streams in a room.

Our goal is to offload this processing from the main Mediasoup server since it’s CPU/memory intensive, and we want to keep our media server lightweight and focused solely on media routing (as it’s the core of our application).

Current Setup:
The Mediasoup server running successfully.

Multiple rooms, each with multiple users.

Each user is publishing a video stream via Mediasoup.

Kubernetes is used to deploy microservices, including the upcoming sidecar service.

Objective:
What would be the best and most efficient architecture to:

Consume all video streams of a room from the Mediasoup server in a Node.js sidecar service (for frame extraction, image processing, etc.).

Ensure the Mediasoup server is not overloaded.

Scale sidecar services horizontally per room or load via Kubernetes HPA.

Specific Questions:
Is it advisable to create a bot-like Mediasoup peer in the sidecar service that joins each room and consumes all producer streams?

Would a single PlainTransport per user stream (or a combined one per sidecar peer) be ideal?

How can I handle routing video of all users from a room into this sidecar while ensuring performance and network isolation?

Are there examples or best practices for handling this kind of “observation bot” consumer setup in Mediasoup?

Any architectural guidance, working examples, or common pitfalls to avoid would be extremely helpful.

Thanks in advance

If you need to do video processing you need a decoder and hence a proper RTP consuming endpoint (with jitter buffer, NACK/PLI/FIR capabilities, etc). You cannot just process video RTP packets the way mediasoup receives or forwards them since they could be out of order and some of them could be missing, and take into account that video frames are split into multiple RTP packets.

This is silly but it works…

The most efficient for the system is having the client capture, process and submit the image. The problems with this though is that client has higher CPU% usage and the image submitted may not be what the user is displaying on stream at all.