Much worse data channel reliability in latest mediasoup-client

jure · March 1, 2023, 10:36am

Hi! We’ve found ourselves upgrading from mediasoup-client 3.6.57 to 3.6.80 recently, and while we didn’t notice anything wrong immediately, the reliability of our data channel connections became a lot worse.

At first we thought some of our other code changes impacted this, but when we reverted only the mediasoup-client upgrade, reliability returned to baseline. We have thousands of data channel connections daily, so the change is definitely statistically significant.

I’ve attached a screenshot of our data channels failed vs data channels opened chart, expressed in %, with times of upgrades/reverts annotated too - in summary, the failure rate went from 0-5% to 10-30%.

Do you have any thoughts on what changes in client could have caused this? I thought it would be good to ask, as we unfortunately do not yet have an isolated reproduction. We will continue digging, however.

Thank you for your help!

ibc · March 1, 2023, 1:20pm

I fail to see how mediasoup-client version can affect reliability of data channel. It’s just a wrapper when it comes to DataChannel API. Which handler are you using? Chrome74 or Chrome111?

jure · March 1, 2023, 2:53pm

We’re as surprised as you are. We’re using new Device() without supplying a specific handler, so leaving it up to the auto-detect.

That would mean that when we switched from 3.6.57 to 3.6.80 the handler also switched? But given that almost nobody is on Chrome 111 yet, the auto-detected handler would likely remain Chrome 74 in most cases.

ibc · March 1, 2023, 3:08pm

device.handlerName will tell you which handler it chose. It’s basically impossible that your problem is related to mediasoup-client version, specially if same Chrome74 handler is being used.

snnz · March 1, 2023, 4:10pm

The legend on the chart says “count(Producer transport failed)”, and what exactly “transport failed” means? That the transport (i.e. peer connection) entered a failed state? But this is not necessarily related to data channels. More specific statistics distingushing different kind of failures is needed to understand the problem.

jure · March 1, 2023, 9:23pm

What I posted is a graph of count("Producer transport state: failed") divided by count("Producer transport state: connected"), both of which are logged .on("connectionstatechange") for our producers.

You’re right, what we’re looking at here specifically is the WebRTC producer transport state, not just the associated data channels, but the effects for us were mostly visible in various data channel related features failing, thus the title. Apologies!

Same reliability change occurred for our consumer transports as well.

Interestingly, plotting the disconnected vs connected state displays no such change. So it looks like more disconnects now go into the failed mode. And indeed plotting failed vs disconnected does clearly reveal that.

We will dig a bit deeper and return with more data!

jure · March 5, 2023, 11:05pm

Turns out it’s a bit more complicated, or less complicated, depending on one’s viewpoint.

Basically our perception of transport reliability was impacted by this change: Handle pc.connectionState (when supported) instead of pc.iceConnectio… · versatica/mediasoup-client@3134822 · GitHub

So in other words while the underlying reliability of our transports has remained the same in 3.6.80 (vs 3.6.57), we now get a lot more “failed” states, which is what we use to signal to our users that their connections has been disrupted. So users complained, since they saw many more network notifications etc. The other side of the coin is worse, however, in that we’ve not been telling Chrome users that their connection has “failed” even when it did die, because we were not aware of it, as the state was “disconnected”.

So the updated version is great, causing us to come to terms with how unreliable these connections can be over the longer term (we get a 10-20% failure rate over say 20-30 minutes, which seems quite high, but I don’t have external data to compare it to - is it high?). As a result of these findings, we’re now implementing connection rescue and proper reconnection mechanisms, as well as digging into transport reliability.

Some more very relevant discussions on this topic for anyone who stumbles across this in the future:

github.com/versatica/mediasoup-client

Transport `connectionstatechange` event mismatch

opened 11:04AM - 06 Dec 22 UTC

closed 11:39AM - 10 Dec 22 UTC

ezioda004

bug

## Bug Report Hi, From the [documentation](https://mediasoup.org/documenta…tion/v3/mediasoup-client/api/#transport-on-connectionstatechange), `transport.on(“connectionstatechange”, fn(connectionState)`, emits `RTCPeerConnectionState`. However, mediasoup-client internally listens and emits `RTCIceConnectionState`. ### Your environment - Operating system: MacOS - Browser version: Chrome 108 - npm version: 7.24 - mediasoup version: 3.11.3 - mediasoup-client version: 3.6.57 ### Issue description The issue with this comes during disconnection. Suppose, you turn off the internet, `connectionstatechange` goes from `connected` -> `disconnect` and there's still a chance that the connection can go `connected` again (there's a 10-second window for the next retry). If the retry is unsuccessful, `RTCPeerConnectionState` emits `failed` event whereas `RTCIceConnectionState` is still in `disconnect` state so transport never emits the `failed` event. For ice restarts, I'm waiting for this `failed` state, rather than `disconnect` state because of the auto retry. Now, since mediasoup-client only emits `RTCIceConnectionState` on transport, I have to internally listen to this private state via `transport.handler._pc.connectionState` to find out if the actual connection state and then do an ice restart. This is mediasoup-client code related to this: ``` // Listens to RTCIceConnectionState and not RTCPeerConnectionState this._pc.addEventListener('iceconnectionstatechange', () => { switch (this._pc.iceConnectionState) { case 'checking': this.emit('@connectionstatechange', 'connecting'); break; case 'connected': case 'completed': this.emit('@connectionstatechange', 'connected'); break; case 'failed': this.emit('@connectionstatechange', 'failed'); break; case 'disconnected': this.emit('@connectionstatechange', 'disconnected'); break; case 'closed': this.emit('@connectionstatechange', 'closed'); break; } }); ``` Wanted to know if this is intentional. If so, then the documentation should reflect the actually used state. Having the transport emit the `RTCPeerConnectionState` would be useful for the above case mentioned instead of listening on the `RTCPeerConnection` object itself.

zaidiqbal · March 6, 2023, 7:24am

I see so this issue of connectionstate stuck at disconnected and not going to failed state is fixed in this new version, that is great.

Topic		Replies	Views
Connection changes to connecting to failed in transport mediasoup libraries	0	203	June 11, 2022
question about datachannel consume mediasoup libraries	1	198	August 16, 2022
Channel request handler with ID 27ca0382-b731-435d-bec0-bfac1600f024 not found [method:transport.consume] mediasoup libraries	3	219	April 4, 2024
Advices on dealing with unreliable networks Deployment & Scalability	7	1389	September 29, 2024
connection failed in producer transport.ICE connection state transitions from checking to disconnected mediasoup libraries	0	44	September 20, 2024

Much worse data channel reliability in latest mediasoup-client

Related topics