Hi! We’ve found ourselves upgrading from mediasoup-client 3.6.57 to 3.6.80 recently, and while we didn’t notice anything wrong immediately, the reliability of our data channel connections became a lot worse.
At first we thought some of our other code changes impacted this, but when we reverted only the mediasoup-client upgrade, reliability returned to baseline. We have thousands of data channel connections daily, so the change is definitely statistically significant.
I’ve attached a screenshot of our data channels failed vs data channels opened chart, expressed in %, with times of upgrades/reverts annotated too - in summary, the failure rate went from 0-5% to 10-30%.
Do you have any thoughts on what changes in client could have caused this? I thought it would be good to ask, as we unfortunately do not yet have an isolated reproduction. We will continue digging, however.
We’re as surprised as you are. We’re using new Device() without supplying a specific handler, so leaving it up to the auto-detect.
That would mean that when we switched from 3.6.57 to 3.6.80 the handler also switched? But given that almost nobody is on Chrome 111 yet, the auto-detected handler would likely remain Chrome 74 in most cases.
The legend on the chart says “count(Producer transport failed)”, and what exactly “transport failed” means? That the transport (i.e. peer connection) entered a failed state? But this is not necessarily related to data channels. More specific statistics distingushing different kind of failures is needed to understand the problem.
What I posted is a graph of count("Producer transport state: failed") divided by count("Producer transport state: connected"), both of which are logged .on("connectionstatechange") for our producers.
You’re right, what we’re looking at here specifically is the WebRTC producer transport state, not just the associated data channels, but the effects for us were mostly visible in various data channel related features failing, thus the title. Apologies!
Same reliability change occurred for our consumer transports as well.
Interestingly, plotting the disconnected vs connected state displays no such change. So it looks like more disconnects now go into the failed mode. And indeed plotting failed vs disconnected does clearly reveal that.
We will dig a bit deeper and return with more data!
So in other words while the underlying reliability of our transports has remained the same in 3.6.80 (vs 3.6.57), we now get a lot more “failed” states, which is what we use to signal to our users that their connections has been disrupted. So users complained, since they saw many more network notifications etc. The other side of the coin is worse, however, in that we’ve not been telling Chrome users that their connection has “failed” even when it did die, because we were not aware of it, as the state was “disconnected”.
So the updated version is great, causing us to come to terms with how unreliable these connections can be over the longer term (we get a 10-20% failure rate over say 20-30 minutes, which seems quite high, but I don’t have external data to compare it to - is it high?). As a result of these findings, we’re now implementing connection rescue and proper reconnection mechanisms, as well as digging into transport reliability.
Some more very relevant discussions on this topic for anyone who stumbles across this in the future: