Sometimes when my ISP cuts the line quickly, the PipeTransport connected remotely won’t recover and fix itself. This makes the media servers send blanks back and forth for some reason.
Are there suggestions on this?
I was considering pings and was told DataChannels but no receiving server can see that channel for some reason so been thinking and besides UDP vs TCP, may be a challenge so was curious if there’s a recovery method or a means to detect a complete stop while using UDP and if not, best option.
Been wanting to prevent PipeTransports acting stuck.
I run my servers so if you broadcast you land on one server and that server fans you to x-many remote servers and I know sometimes it’s a location thing. I could have 6/7 servers routing fine but the 1 server sending no stream.
I do ICE and call for it all when user needs it but PipeTransports (server-side) I must assume connection or in-time. The stream will eventually be valid to consume. Not sure if it’s recovering though, the user will get a stream (audio/video) and connected but the Pipes are doing weird stuff.
I think if we check RTC statistics of PipeTransport after some interval then we can see when it failed like if not received bytes for specific interval then we can re-initiate the piping process causing things to refresh, but that might be overkill if we keep checking it after some interval.
Another thing is that on app side we know that when peer connection goes to ‘failed’ state so when majority of the ‘consumers’ of this ‘pipe transport’ go into failed status then the server can assume that pipe transport got disturbed so it can re-initiate the pipe transport to refresh the things.
As PipeTransport is Ice-Lite I think there is no direct way to check on server whether PipeTransport is failed.
I’ll confirm if I can gather the statistics per PipeTransport (remotely). I forget if I checked that already, but could be the answer.
Yeah I can’t see if a PipeTransport failed or connected, I must manually signal each state and assume it worked. So yeah, will see if stats works at all in this scenario to tell me if things are failing.
UPDATE:
I can pull the stats, I’ll have to perform some tests, thinking if bytes received/sent stops growing for x amount of time restart the worker in hopes it was just that servers route. I’ll have to get back on this one I need to do some network trickery on my end and replicate my issues.
You may find an easier time listening for the pipetransport event sctpstatechange. mediasoup :: API This should give you the event you need to act on when the pipe state becomes failed or closed.
That seems better option if it works on the same transport, I mean when we enableSctp on PipeTransport and we are sending track over it then if that transport goes into disconnected or failed status then will that transport fire sctp state change event?
Not really. Such an event ha opens when the remote sends us a DTLS ALERT message. If the remote crashes or its connection to the network is completely down, then it won’t be able to send any DTLS ALERT to us, and we won’t realize of any disconnection.
No magic here. This is the same in UDP, TCP and everywhere. Unless you use a ping-pong mechanism you cannot be sure of the connection state.
@ibc beside ping pong option, we have this RTC statistics way to determine the state of the transport by checking if bytes are not received for specific interval. Do you think it will be reliable?
What I mean is that we have a pipe transport and on that transport we are sending/receiving a track right, on server side we can see the number of bytes being sent/received so if for some reasons there is no byte being sent/received for like 20 seconds we can surely say there is something wrong so we can reinstantiate the transport process to recover. Will that be reliable?
And if there is no track being sent? And if the Producer of such a track is having network problems so no RTP packets are sent for more than 20 seconds through the pipe transport? Would that mean that the pipe transport that connects two mediasoup Routers (in different servers) has disconnected?
I mean, you can do this as you wish, but I fail to understand why nobody wants to do this properly. No magic here: the only way to know if a connection through the network is still connected is by sending something and receiving an ACK or by checking that the remote sends an expected “ping” message to us every N seconds. That’s true for pipe transport, for WebRTC transports (ICE keep alive is exactly that), for basic TCP connections (in which TCP build-in keepalive can be enabled or app level ping/pong can be implemented), for UDP “connections”, etc etc.
So, if you have a pipe transport connecting 2 mediasoup servers, and you can send periodic ping via SCTP from one to each other, and you can monitor the receipt of ping and/or pong messages… then why would you prefer to rely on other received bytes that may legitimately not happen due to reasons you cannot really control?
If we have a producer on pipetransport and we have enabled SCTP as well on it then those SCTP messages will go through the same pipetransport or it will use some other mechanism? I mean If we didn’t receive ping/pong on SCTP lets say for 10 seconds then will it also mean that the track is also not flowing for that specific period of time?
Again: I don’t mean “forwarding of messages sent by users”. I mean generate and send a ping directly from one server to the other and vice versa. DirectTransport, DataProducer and DataConsumer.
The statistics tells me if the bitrate received/sent hasn’t changed in several seconds we’ve timed-out.
Last I checked, I couldn’t use Data Channels on PipeTransports. It should work, but for some reason the channel is never present on either server. So unless there’s something I missed I would have loved to use that approach to ping/pong with DataProducer and DataConsumers.
I’m in the middle of bathroom renovations right now but I would gladly find time to replicate this issue and explain it further. Otherwise I would agree ping/pong event would describe higher latency best or complete cut-off.
I’ll test again soon to confirm but data channels were not functioning correctly in my tests.
I do use a VM where I may open several workers and make them act local or remote (over same IP). Maybe I set things up wrong but I’ll open another topic describing this issue when I get to it.
IBC you work hard, sorry this post lead to you being tagged. I’ll update this post with most suitable answer when I have things setup the way I need it to help others.