I am using MediaSoup to build a voice chat application for language exchange learning. I would like to record the content of the calls, as well as send the audio to Google’s Speech-to-Text API. The Speech-to-text API accepts audio in the OGG-OPUS format. So I need to repackage it from OGG-RTP. I’m considering two approaches:
Approach #1: Use a DirectConsumer in my node process.
I’ve been using https://github.com/libersys/rtp-ogg-opus to decode the RTP packets and send the frames into libopustools. This works, but it doesn’t handle any lost or unordered RTP packets. Following the discussion from this github issue, it feels like my naive implementing isn’t enough.
https://github.com/versatica/mediasoup/issues/433
Question: How much is a DirectConsumer insulated from the unreliability of RTP? I know MediaSoup will request dropped packets from a producer. Does this mean the DirectConsumer doesn’t need to worry about this?
I think for the purposes of recording, a jitter-buffer isn’t necessary. Is this a reasonable assumption?
Approach #2: Connect each call through a PlainRtpTransport to ffmpeg (or maybe PyAV) and pipe back OGG-OPUS.
This seems to be a pretty standard configuration for MediaSoup, but it’s a little more complicated.
I have a proof-on-concept working for both approaches. If anyone has any feedback on either, it would be very helpful.
Thank you!