There appears to be a potential race condition in the SimulcastConsumer’s spatial layer switching logic that could lead to the corruption of a key frame if its constituent RTP packets arrive out of order. This can result in a failed NACK recovery, forcing a more disruptive PLI/FIR recovery cycle.
Problem Analysis:
The core of the issue lies in how SimulcastConsumer handles the transition to a new spatial layer. The logic seems to make a simplifying assumption: that the first packet of a key frame from the new target spatial layer to arrive is also the one with the lowest sequence number.
This is a specific scenario:
The SimulcastConsumer is currently forwarding spatialLayer = 0. The application requests a switch to a higher layer, so targetSpatialLayer becomes 1.
The Producer sends a key frame for spatialLayer = 1. This key frame is large and is fragmented into multiple RTP packets (e.g., with sequence numbers 1000, 1001, 1002).
Due to network conditions, the packets arrive at the SimulcastConsumer in a reordered sequence: 1001, 1000, 1002.
SimulcastConsumer Processes This:
Packet 1001 Arrives First: It’s a key frame on the targetSpatialLayer. Packet 1001 is processed and forwarded.
Packet 1000 Arrives Second: Identified as an “old” packet from the previous layer and is dropped.
if (!shouldSwitchCurrentSpatialLayer && this->checkingForOldPacketsInSpatialLayer)
{
// If this is a packet previous to the spatial layer switch, ignore the packet.
if (SeqManager<uint16_t>::IsSeqLowerThan(
packet->GetSequenceNumber(), this->snReferenceSpatialLayer))
{
#ifdef MS_RTC_LOGGER_RTP
packet->logger.Dropped(
RtcLogger::RtpPacket::DropReason::PACKET_PREVIOUS_TO_SPATIAL_LAYER_SWITCH);
#endif
this->rtpSeqManager->Drop(packet->GetSequenceNumber());
return;
}
else if (SeqManager<uint16_t>::IsSeqHigherThan(
packet->GetSequenceNumber(), this->snReferenceSpatialLayer + MaxSequenceNumberGap))
{
this->checkingForOldPacketsInSpatialLayer = false;
}
}
Consequences:
The SimulcastConsumer forwards an incomplete key frame to the receiver.
The receiver’s decoder fails, leading to visual artifacts (freezing, green screen, etc.).
The receiver’s Jitter Buffer detects the missing packet and sends a NACK request.
The NACK request fails because the dropped packet was never added to the RtpStreamSend history buffer in mediasoup.
After the NACK mechanism times out, the only remaining recovery option for the client is to send a PLI or FIR, forcing a full key frame request from the producer. This introduces significant delay and a noticeable disruption for the end-user.
My analysis is based solely on code inspection. It is entirely possible that I have overlooked another mechanism or a different part of the codebase that already mitigates this issue. If that is the case, I would be very grateful for a pointer to the relevant code or an explanation of the intended behavior.
AFAIK only the first RTP packet of the key frame will be detected as a key frame, and not the subsequent ones, so the issue you mention should never take place. However it’s an interesting topic and now that we are implementing Dependency Descriptor I don’t know if all RTP packets containing chunks of the same key frame may have the “key frame” flag set to 1 in the DD extension.
Could you please open a ticket in mediasoup GitHub and also mention the Dependency Descriptor thing I say above? I am out for some days and I don’t want to forget about this.
Assuming that a key frame in AV1 of 3600 bytes takes 3 RTP packets to carry it:
Only one of the RTP packets — typically the first one of the frame — will carry the full Dependency Descriptor with is_key_frame = 1.
According to the AV1 RTP Payload spec and the Dependency Descriptor draft:
The Dependency Descriptor describes a full frame, not just a fragment.
Usually, only the first RTP packet of the frame (the one marked as the start of the frame, with Z=1) includes:
The full frame dependency structure
Fields like is_key_frame, frame_number, frame_dependencies, etc.
Subsequent RTP packets of the same frame:
May omit the DD entirely
Or may carry a reduced DD, which often doesn’t include the is_key_frame field again.
The is_key_frame (or is_switch_frame) field won’t be set to 1 in all RTP packets of the frame. It will only appear in the packet that includes the full Dependency Descriptor — typically the first.
As Iñaki said, the upcoming new version will contain Dependency Description for H264 and it will behave as such: just the first packet of the key frame contains all the dependency structure and will be considered by mediasoup as key frame.
VP8 keyframe takes two RTP packets with seq numbers 1001 and 1002.
Producer device sends both but 1002 arrives to mediasoup first.
mediasoup sends NACK requesting retransmission of 1001 but anyway it will arrive later.
Packet 1002 is not a keyframe so the SimulcastConsumer that was waiting for a keyframe of that ssrc will ignore it and WONT store it in the retransmission buffer of the consumer.
Later packet 1001 arrives and the SimulcastConsumer allows it to pass and it switches to that ssrc stream, so the consumer device receives that packet 1001 which is the first half of the keyframe.
Later the producer sends packet 1003 and the consumer device receives it.
Then the consumer device sends NACK requesting packet 1002.
But 1002 is not in the retransmission buffer of the SimulcastConsumer so mediasoup will not retransmit it.
Result is that 1002 will never be sent to the consumer device and hence, after some terrible seconds the consumer device will send PLI requesting a new keyframe.
Is this correct? Or is step 8 is wrong? If this assumption is correct there is room for improvement.
A consumer is waiting to switch up to a new spatial layer and needs a key frame from that layer.
A key frame is sent, split into multiple packets (100,101,102).
Due to network reordering, the second packet of the key frame (101) arrives before the first packet (100).
Based on the current logic, my understanding is that only the first packet (100) is identified as a key frame. When 101 arrives first, the consumer will see it’s for the correct target layer but not a key frame, and will therefore drop it.
Later, when 100 arrives, the layer switch will proceed, but 101 has already been permanently lost. This results in a corrupted key frame being sent to the receiver, and since the packet was dropped early, it cannot be recovered via NACK.