Media Server sourcing audio chunks from an external provider via Websocket

My question is regarding using a MediaServer that has a RTCPeerConnection with a client (Mobile - react native) whereby the data between the media-server and the react-native-mobile app can exchange data in real time.

The react-native audio data is simply a recording of the users voice or just their audio in real-time, which can be accessed and establish a RTC connection with the mediaserver using react-native-webrtc.

Now the media-server is responsible for taking this audio data sending it to an external provider (11Labs) and then returning the audio generated by 11 Labs back to the user in real-time. I know there is inherent latency in the fact that my mediaserver has to wait for 11 labs to process the data it receives, but is this possible?

What are the considerations? Do I need to see whether 11 Labs supports establishing a real-time communication? Is a Real-time connection between 11 labs and my mediaserver relevant?

My understanding of mediasoup is that it can allow me to create a media server that can act as a peer in a PeerConnection. MediaSoup also acts an SFU (Selective Forwarding Unit) - it can take audio from a peer in the connection and route it to another peer, ideally I want one of my peers to be my audio server that gets data from 11 labs via a websocket (see 11 labs doc

If I need to register the external service as a producer of data, would I have to ensure the external service provider has rtp capabilites? (reference How to add media stream to a producer)