realtime AI voice Agent with mediasoup

Hi first of all thank you for building great library.

I have been building realtime AI voice-agent with mediasoup. I have encounter some problem i would like to have your suggestion

How does my app work

  • From client to server using webRTC transport
  • Consume the client media with direction transport
  • Media is transfered to SileroVAD for speech detection and then to Deepgram STT via websocket for speech to text and then to LLM with Langchain and then to deepgramTTS for speech sythensize and then speech is added with rtpHeader and ssrc and then transfer back to client via webRTC transport.

Problem Is:

  • SileroVAD uses PCM audio with mono channel to detect speech so i need decode the opus media to PCM 16Bit and convert it into float32Array and then mono channel to pass the array which detects the speech

Question:

  1. Should i use plainTransport to transfer the media to other server which has mediasoup. In server B where i can decode it and resample it and pass to sileroVAD and other AI processes
  2. Or should i use websocket or grpc just to decode resample and convert it into mono channel in Server B and then pass the media in Server A where it detects the speech and passes to other processes

Note: SileroVAD uses onnxruntime-node and Deepgram uses websockets

Using Mediasoup plainTransport should be fine. It provides efficient RTP transport, lower latency than WebSocket/GRPC, and allows decoding/resampling (via ffmpeg, …) with less CPU load on the WebRTC server.

I made something similar using mediasoup: