Transports never transition to 'connected', seemingly due to using internal IP for local ICE candidates

We’ve been using mediasoup without issue for over a year now, but in the last couple of weeks have had intermittent problems with transports not connecting and then transitioning within 10 seconds to the ‘disconnected’ state in our deployed app (local testing always works fine). I’m not sure exactly when this happened, so I don’t have a handy version of our deployment to roll back to to compare.

The weird thing is how intermittent these problems are. Sometimes both the send and receive transports that a client is setting up connect successfully; sometimes both fail; sometimes one succeeds and the other fails.

I’ve had to relearn a lot of WebRTC concepts trying to debug this, since the stability of mediasoup up to this point has led me to focus on other parts of our application. The debug process has me currently looking at chrome://webrtc-internals to try to figure out why these connections aren’t successful. It seems like, when the transports do connect successfully, the local RTCIceCandidate that gets used for the connection has my external IP address, while when the transports fail to connect, all of the local RTCIceCandidates that are listed have internal IPs, e.g. 10.1.214.64, 192.168.0.60, etc. I could see this being the problem, if the client is only giving the server internal addresses, and the server is obviously not able to respond correctly if it’s trying those internal addresses. I’m not sure how mediasoup is getting the local addresses to send to the server, so I’m not sure why it sometimes gets the external address and succeeds and sometimes gets only internal addresses and fails.

This is definitely not an issue because no candidates are sent to the server at all. This is not needed because server’s IP address is publicly reachable already.

It doesn’t at all.

So what in this whole RTC/ICE process is pairing up local candidates with remote candidates? In the webrtc-internals page, I always see 2-3 local ICE candidates and 1 remote candidate. The remote candidate always looks valid, i.e. it’s the IP address of one of our servers running mediasoup; the port also looks valid, as in it’s a port that is publicly available to the outside world that gets remapped to one of the RTC ports that the mediasoup workers can use.

The main difference I see between successful connections and unsuccessful ones is that unsuccessful connections have no local ICE candidates with the public IP of the client that’s trying to connect, while successful ones always have at least one local ICE candidate that’s a public client IP. I don’t know if this is a symptom of another problem, or the problem itself, but it’s very consistently a difference.

Look up ICE-Lite. This is what mediasoup implements.

So the problem appeared to be due to not properly contacting a STUN server some of the time, resulting in the client not getting its public IP address. I’m not sure why this became a problem in the last few weeks, since there’d never been an issue on that front prior to that. I don’t know what STUN servers are reached out to when making an RTCPeerConnection with no iceServers specified, but when I passed in an array of a few publicly available STUN servers, the problem went away (on Chrome at least, Firefox still doesn’t work and outputs errors saying that it needs a TURN server, but that’s a problem I can investigate later).

If iceServer is empty then no servers are reached out to, just interfaces that browser sees directly on the machine are used.